Site icon Tomato Soup

How to Query File Attributes 50x faster on Windows

Imagine you’re developing a tool that needs to scan for file changes across thousands of project files. Retrieving file attributes efficiently becomes critical for such scenarios. In this article, I’ll demonstrate a technique to get file attributes that can achieve a surprising speedup of over 50+ times compared to standard Windows methods.

Let’s dive in and explore how we can achieve this.

This is a blog post made in collaboration with Bartlomiej Filipek from C++ stories. You can visit his blog here.

The inspiration

The inspiration for this article came from a recent update for Visual Assist – a tool that heavily improves Visual Studio experience and productivity for C# and C++ developers.

In one of their blog post, they shared:

The initial parse is 10..15x faster!

What’s New in Visual Assist 2024—Featuring lightning fast parser performance [Webinar] – Tomato Soup

After watching the webinar, I noticed some details about efficiently getting file attributes and I decided to give it a try on my machine. In other words I tried to recreate their results.

Disclaimer: Idera, the company behind Visual Assist, helped me write this post and sponsored it.

Understanding File Attribute Retrieval Methods on Windows

On Windows, there are at least a few options to check for a file change:

Below, you can see some primary usage of each approach:

FindFirstFileEx

FindFirstFileEx is a Windows API function that allows for efficient searching of directories. It retrieves information about files that match a specified file name pattern. The function can be used with different information levels, such as FindExInfoBasic and FindExInfoStandard, to control the amount of file information fetched.

WIN32_FIND_DATA findFileData;
HANDLE hFind = FindFirstFileEx((directory + "\\*").c_str(), FindExInfoBasic, &findFileData, FindExSearchNameMatch, NULL, 0);

if (hFind != INVALID_HANDLE_VALUE) {
    do {
        // Process file information
    } while (FindNextFile(hFind, &findFileData) != 0);
    FindClose(hFind);
}

Additionally you can also pass FIND_FIRST_EX_LARGE_FETCH as the additional flag to indicate that the function should use a larger buffer which might bring some extra performance.

GetFileAttributesEx

GetFileAttributesEx is another Windows API function that retrieves file attributes for a specified file or directory. Unlike FindFirstFileEx, which is used for searching and listing files, GetFileAttributesEx is typically used for retrieving attributes of a single file or directory.

WIN32_FILE_ATTRIBUTE_DATA fileAttributeData;
if (GetFileAttributesEx((directory + "\\" + fileName).c_str(), GetFileExInfoStandard, &fileAttributeData)) {
    // Process file attributes
}

GetFileInformationByHandleEx

GetFileInformationByHandleEx is a low level routine that might be tricky to use, but gives us more control over the iteration. The main idea is to get a lerge buffer of data and read it on the application side, rather than rely on sometimes costly kernel/system calls.

Assuming you have a file open, which is a directory, you can iterate over its children in the following way:

while (true) {
    if (!GetFileInformationByHandleEx(
        hDir,
        FileFullDirectoryInfo,
        pInfo,
        sizeof(buffer))) {
        DWORD error = GetLastError();
        if (error == ERROR_NO_MORE_FILES) {
            break;
        }
        else {
            std::wcerr << L"GetFileInformationByHandleEx failed (" << error << L")\n";
            break;
        }
    }

    do {
        if (!(pInfo->FileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
            FileInfo fileInfo;
            fileInfo.fileName = std::wstring(pInfo->FileName, pInfo->FileNameLength / sizeof(WCHAR));
            FILETIME ft{};
            ft.dwLowDateTime = pInfo->LastWriteTime.LowPart;
            ft.dwHighDateTime = pInfo->LastWriteTime.HighPart;
            fileInfo.lastWriteTime = ft;
            files.push_back(fileInfo);
        }
        pInfo = reinterpret_cast<FILE_FULL_DIR_INFO*>(
            reinterpret_cast<BYTE*>(pInfo) + pInfo->NextEntryOffset);
    } while (pInfo->NextEntryOffset != 0);
}

std::filesystem

Introduced in C++17, the std::filesystem library provides a modern and portable way to interact with the file system. It includes functions for file attribute retrieval, directory iteration, and other common file system operations.

for (const auto& entry : fs::directory_iterator(directory)) {
    if (entry.is_regular_file()) {
        // Process file attributes
        auto ftime = fs:last_write_time(entry);
        ...
    }
}

The Benchmark

To evaluate the performance of different file attribute retrieval methods, I developed a small benchmark. This application measures the time taken by each method to retrieve file attributes for N number of files in a specified directory.

Here’s a rough overview of the code:

The FileInfo struct stores the file name and last write time.

struct FileInfo {
    std::wstring fileName;
    std::variant<FILETIME, std::filesystem::file_time_type> lastWriteTime;
};

Each retrieval technique will have to go over a directory and build a vector of FileInfo objects.

BenchmarkFindFirstFileEx

void BenchmarkFindFirstFileEx(const std::string& directory, 	
                              std::vector<FileInfo>& files, 
                              FINDEX_INFO_LEVELS infoLevel) 
{
   WIN32_FIND_DATA findFileData;
   HANDLE hFind = FindFirstFileEx((directory + "\\*").c_str(),
                                   infoLevel, 
                                   &findFileData, 
                                   FindExSearchNameMatch, NULL, 0);

   if (hFind == INVALID_HANDLE_VALUE) {
       std::cerr << "FindFirstFileEx failed (" 
                 << GetLastError() << ")\n";
       return;
   }

   do {
       if (!(findFileData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
           FileInfo fileInfo;
           fileInfo.fileName = findFileData.cFileName;
           fileInfo.lastWriteTime = findFileData.ftLastWriteTime;
           files.push_back(fileInfo);
       }
   } while (FindNextFile(hFind, &findFileData) != 0);

   FindClose(hFind);
}

BenchmarkGetFileAttributesEx

void BenchmarkGetFileAttributesEx(const std::string& directory,
                                  std::vector<FileInfo>& files) 
{
   WIN32_FIND_DATA findFileData;
   HANDLE hFind = FindFirstFile((directory + "\\*").c_str(),
                                &findFileData);

   if (hFind == INVALID_HANDLE_VALUE) {
       std::cerr << "FindFirstFile failed (" 
                 << GetLastError() << ")\n";
       return;
   }

   do {
       if (!(findFileData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
           WIN32_FILE_ATTRIBUTE_DATA fileAttributeData;
           if (GetFileAttributesEx((directory + "\\" + findFileData.cFileName).c_str(), GetFileExInfoStandard, &fileAttributeData)) {
               FileInfo fileInfo;
               fileInfo.fileName = findFileData.cFileName;
               fileInfo.lastWriteTime = fileAttributeData.ftLastWriteTime;
               files.push_back(fileInfo);
           }
       }
   } while (FindNextFile(hFind, &findFileData) != 0);

   FindClose(hFind);
}

BenchmarkStdFilesystem

And the last one, the most portable technique:

void BenchmarkStdFilesystem(const std::string& directory, 
                            std::vector<FileInfo>& files) 
{
    for (const auto& entry : std::filesystem::directory_iterator(directory)) {
        if (entry.is_regular_file()) {
            FileInfo fileInfo;
            fileInfo.fileName = entry.path().filename().string();
            FILETIME ft{};
            ft.dwLowDateTime = pInfo->LastWriteTime.LowPart;
            ft.dwHighDateTime = pInfo->LastWriteTime.HighPart;
            fileInfo.lastWriteTime = ft;
            files.push_back(fileInfo);
        }
    }
}

BenchmarkGetFileInformationByHandleEx

void BenchmarkGetFileInformationByHandleEx(const std::wstring& directory, std::vector<FileInfo>& files) {
    HANDLE hDir = CreateFileW(
        directory.c_str(),
        GENERIC_READ,
        FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_BACKUP_SEMANTICS,
        NULL
    );

    if (hDir == INVALID_HANDLE_VALUE) {
        std::wcerr << L"CreateFile failed (" << GetLastError() << L")\n";
        return;
    }

    constexpr DWORD BufferSize = 64 * 1024;
    uint8_t buffer[BufferSize];
    FILE_FULL_DIR_INFO* pInfo = reinterpret_cast<FILE_FULL_DIR_INFO*>(buffer);

    while (true) {
        if (!GetFileInformationByHandleEx(
            hDir,
            FileFullDirectoryInfo,
            pInfo,
            sizeof(buffer))) {
            DWORD error = GetLastError();
            if (error == ERROR_NO_MORE_FILES) {
                break;
            }
            else {
                std::wcerr << L"GetFileInformationByHandleEx failed (" << error << L")\n";
                break;
            }
        }

        do {
            if (!(pInfo->FileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
                FileInfo fileInfo;
                fileInfo.fileName = std::wstring(pInfo->FileName, pInfo->FileNameLength / sizeof(WCHAR));
                FILETIME ft{};
                ft.dwLowDateTime = pInfo->LastWriteTime.LowPart;
                ft.dwHighDateTime = pInfo->LastWriteTime.HighPart;
                fileInfo.lastWriteTime = ft;
                files.push_back(fileInfo);
            }
            pInfo = reinterpret_cast<FILE_FULL_DIR_INFO*>(
                reinterpret_cast<BYTE*>(pInfo) + pInfo->NextEntryOffset);
        } while (pInfo->NextEntryOffset != 0);
    }

    CloseHandle(hDir);
}

The Main Function

The main function sets up the benchmarking environment, runs the benchmarks, and prints the results.

std::wstring directory = argv[1];
const auto arg2 = argc > 2 ? std::wstring_view(argv[2]) : std::wstring_view{};

std::vector<std::pair<std::wstring, std::function<void(std::vector<FileInfo>&)>>> benchmarks = {
    {L"FindFirstFileEx (Basic)", [&](std::vector<FileInfo>& files) {
        BenchmarkFindFirstFileEx(directory, files, FindExInfoBasic, 0);
    }},
    {L"FindFirstFileEx (Standard)", [&](std::vector<FileInfo>& files) {
        BenchmarkFindFirstFileEx(directory, files, FindExInfoStandard, 0);
    }},
    {L"FindFirstFileEx (Large Fetch)", [&](std::vector<FileInfo>& files) {	BenchmarkFindFirstFileEx(directory, files, FindExInfoStandard, FIND_FIRST_EX_LARGE_FETCH);
    }},
    {L"GetFileAttributesEx", [&](std::vector<FileInfo>& files) {
        BenchmarkGetFileAttributesEx(directory, files);
    }},
    {L"std::filesystem", [&](std::vector<FileInfo>& files) {
        BenchmarkStdFilesystem(directory, files);
        }},
    {L"GetFileInformationByHandleEx", [&](std::vector<FileInfo>& files) {
        BenchmarkGetFileInformationByHandleEx(directory, files);
    }}
};

std::vector<std::pair<std::wstring, double>> results;

for (const auto& benchmark : benchmarks) {
    std::vector<FileInfo> files;
    files.reserve(2000); // Reserve space outside the timing measurement

    auto start = std::chrono::high_resolution_clock::now();
    benchmark.second(files);
    auto end = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed = end - start;
    results.emplace_back(benchmark.first, elapsed.count());
}

PrintResultsTable(results);

Performance Results

To measure the performance of each file attribute retrieval method, I executed benchmarks on a directory containing 1000, 2000 or 5000 random text files. The tests were performed on a laptop equipped with an Intel i7 4720HQ CPU and an SSD. I measured the time taken by each method and compared the results to determine the fastest approach.

Each test run consisted of two executions: the first with uncached file attributes and the second likely benefiting from system-level caching.

The speedup factor is the factor of the current result compared to the slowest technique in a given run.

1000 files:

Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0014831000         162.868
FindFirstFileEx (Standard)     0.0014817000         163.022
FindFirstFileEx (Large Fetch)  0.0011792000         204.842
GetFileAttributesEx            0.2415497000         1.000
std::filesystem                0.0609313000         3.964
GetFileInformationByHandleEx   0.0044168000         54.689

// second run:
Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0013805000         44.947
FindFirstFileEx (Standard)     0.0011310000         54.863
FindFirstFileEx (Large Fetch)  0.0009071000         68.404
GetFileAttributesEx            0.0616772000         1.006
std::filesystem                0.0620496000         1.000
GetFileInformationByHandleEx   0.0025246000         24.578

Directory with 2000 files:

Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0014455000         150.287
FindFirstFileEx (Standard)     0.0015029000         144.547
FindFirstFileEx (Large Fetch)  0.0012086000         179.745
GetFileAttributesEx            0.2172402000         1.000
std::filesystem                0.0609186000         3.566
GetFileInformationByHandleEx   0.0025069000         86.657

Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0012020000         50.908
FindFirstFileEx (Standard)     0.0011614000         52.688
FindFirstFileEx (Large Fetch)  0.0008887000         68.856
GetFileAttributesEx            0.0611920000         1.000
std::filesystem                0.0611760000         1.000
GetFileInformationByHandleEx   0.0025835000         23.686

Directory with 5000 random, small text files:

Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0077623000         84.975
FindFirstFileEx (Standard)     0.0828258000         7.964
FindFirstFileEx (Large Fetch)  0.0144611000         45.612
GetFileAttributesEx            0.6595977000         1.000
std::filesystem                0.3022779000         2.182
GetFileInformationByHandleEx   0.0051569000         127.906

Method                         Time (seconds)       Speedup Factor
FindFirstFileEx (Basic)        0.0069814000         43.844
FindFirstFileEx (Standard)     0.0148472000         20.616
FindFirstFileEx (Large Fetch)  0.0140663000         21.761
GetFileAttributesEx            0.3060932000         1.000
std::filesystem                0.3011346000         1.016
GetFileInformationByHandleEx   0.0051614000         59.304

The results consistently showed that FindFirstFileEx with the Standard flag was the fastest method in uncached scenarios, offering speedups up to 129x compared to GetFileAttributesEx. However, in cached scenarios, FindFirstFileEx (Basic and Standard) achieved over 50x speedup improvements. The parameters for “Large Fetch” seems to increase the performance.

For the directory with 2000 files, FindFirstFileEx (Basic) demonstrated a speedup factor of over 179x in the first run and went down to 68 in the second run. In the directory with 5000 files, we can see that GetFileInformationByHandleEx takes crown and acheives 59x speedup, while other techniques reaches 43x max. Notably, std::filesystem performed on par with GetFileAttributesEx .

Further Techniques

Getting file attributes is only part of the story, and while important, they may contribute to only a small portion of the overall performance for the whole project. The Visual Assist team, who contributed to this article, improved their initial parse time performance by avoiding GetFileAttributes[Ex] using the same techniques as this article. But Visual Assist also improved performance through further techniques. My simple benchmark showed 50x speedups, but we cannot directly compare it with the final Visual Assist, as the tool does many more things with files.

The main item being optimised was the initial parse, where VA builds a symbol database when a project is opened for the first time. This involves parsing all code and all headers. They decided that it’s a reasonable assumption that headers won’t change while a project is being loaded, and so the file access is cached during the initial parse, avoiding the filesystem entirely. (Changes after a project has been parsed the first time are, of course, still caught.) The combination of switching to a much faster method for checking filetimes and then avoiding file IO completely contributed to the up-to-15-times-faster performance improvement they saw in version 2024.1 at the beginning of this year.

Read further details on their blog Visual Assist 2024.1 release post – January 2024 and Catching up with VA: Our most recent performance updates – Tomato Soup.

Summary

In the text, we went through a benchmark that compares several techniques for fetching file attributes. In short, it’s best to gather attributes at the same time as you iterate through the directory – using FindFirstFileEx or via GetFileInformationByHandleEx. So if you want to do this operation hundreds of times, it’s best to measure time and choose the best technique. What’s more, if you expect to have lots of files in a directory it’s good to check techniques offering larger buffers.

The benchmark also showed one feature: while C++17 and its filesystem library offer a robust and standardized way to work with files and directories, it can be limited in terms of performance. In many cases, if you need super optimal performance, you need to open the hood and work with the specific operating system API.

Back to you

Share your comments below. And if you’re using C++, you can also download and try Visual Assist yourself for 30 days for free.

Exit mobile version