How can I preallocate disk space for a file without it being reported as readable?

Date:July 14, 2016 / year-entry #147
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20160714-00/?p=93875
Comments:    18
Summary:Set the file allocation information.

A customer wanted to create a file with the following properties:

  • The file has a known maximum size. (The file is a log file, and when the log file gets full, the program closes the log file and creates a new one.)

  • Disk space for the log file should be preallocated up to the maximum size.

  • Aside from the fact that disk space for the maximum size has been preallocated, the file should behave like a normal file: Code that reads the log file should be able to read up to the last written byte, but if they try to read past the last written byte, they should get "end of file reached".

The last requirement exists because there are third party tools that read the log files, and those tools are just going to use traditional file I/O to access the log file.

The customer suggested an analogy: "If we were operating on std::vector, then what I'm looking for is vector.reserve() to expand the vector's capacity, and vector.push_back() to append entries. Code that iterates over the vector or reads the vector.size() see only the vector elements that have been pushed onto the vector."

The file system team responded with this solution:

Use the Set­File­Information­By­Handle function, passing function code File­Allocation­Info and a FILE_ALLOCATION_INFO structure. "Note that this will decrease fragmentation, but because each write is still updating the file size there will still be synchronization and metadata overhead caused at each append."

The effect of setting the file allocation info lasts only as long as you keep the file handle open. When you close the file handle, all the preallocated space that you didn't use will be freed.

Here goes a Little Program. Remember, Little Programs do little to no error checking.

#include <windows.h>

int __cdecl main(int argc, char** argv)
{
  auto h = CreateFile(L"test.txt", GENERIC_ALL,
    FILE_SHARE_READ, nullptr, CREATE_ALWAYS,
    FILE_ATTRIBUTE_NORMAL, nullptr);
  FILE_ALLOCATION_INFO info;
  info.AllocationSize.QuadPart =
    1024LL * 1024LL * 1024LL * 100; // 100GB
  SetFileInformationByHandle(h, FileAllocationInfo,
    &info, sizeof(info));
  for (int i = 0; i < 10; i++) {
    DWORD written;
    WriteFile(h, "hello\r\n", 7, &written, nullptr);
    Sleep(5000);
  }
  CloseHandle(h);
  return 0;
}

This program creates a file and preallocates 100GB of disk space for it. It then writes to the file very slowly. While the program is running, you can do a type test.txt to read the contents of the file, and it will print only the contents that were written. Watch the free disk space on the drive, and you'll see that it drops by 100GB while the program is running, and then most of the disk space comes back when the program exits.

The preallocated disk space is also released when you call Set­End­Of­File.

There's a special gotcha about setting the file allocation info: If you set the file allocation info to a nonzero value, then the file contents will be forced into nonresident data, even if it would have fit inside the MFT.


Comments (18)
  1. Matteo Italia says:

    Given that a file is more like an std::deque than an std::vector (data is allocated in biggish chunks and is never copied around) it’s not really clear what kind of performance advantage they are after by preallocating everything; after all, deque itself doesn’t have a reserve method because it’s mostly useless. Even the additional locality shouldn’t matter much, given that a log is normally append only (and is read sequentially). Maybe the customer had some mistaken idea about the inner workings of the file system?

    1. Brian says:

      I see two reasons they may want to do this:
      1) you get fail on open (well, failure around the time the Open happens) rather than fail on write
      2) you get to reserve your space up front, making sure that no other user of the volume can take of the space you need (again working to prevent fail on write). It’s like sending someone into the movie theatre early to reserve 8 seats before anyone else arrives.

      1. DWalker says:

        But… but.. let’s say the disk only had 50GB of space left. Which is better: To write 50GB of log, and then fail, or to fail when trying to create a 100 GB log file? In the second choice, nothing gets logged and the program might not even start.

        1. Brian says:

          In that case, in a program bigger than a “little program”, you have a second failure path that creates a smaller log file. In that log file, you write “Couldn’t create Log file – Quitting” (or, you steal as much space as makes sense and pre-allocate a smaller log). The idea is to reduce the likelihood of a fail-on-write-to-the-log as much as you can.

          1. Kevin says:

            Logging is a particularly good example of this pattern because it is a cross-cutting concern. A well-written application is likely to perform logging calls at many different layers of abstraction and in many different contexts. It is not practical to correctly handle a logging failure at every one of these call sites, so most sane logging frameworks just swallow logging errors silently (with perhaps a message to stderr, if you’re lucky). In this regard it is much like how many garbage-collected languages handle throwing finalizers: you can’t clean up from a failed cleanup, nor is the application in a good position to decide what to do about it, so just ignore it and destroy the object anyway.

  2. What are those “LL” in front of 1024? What do they do?

    1. TimothyB says:

      It’s a C++ number suffix to say that the constant is a long long.

    2. Matt Denham says:

      It indicates that they’re of type “long long”, which is important here mostly to ensure that the multiplication ends up with the correct type (if it stayed in a 32-bit type, it’d end up as 0 instead since 100GB = 0 modulo 2^32).

    3. Steve says:

      LL indicates that the integer literal should be treated as type “long long”.

      http://en.cppreference.com/w/cpp/language/integer_literal

    4. Brian says:

      From https://msdn.microsoft.com/en-us/library/c70dax92.aspx
      To specify an unsigned type, use either the u or U suffix. To specify a long type, use either the l or L suffix. To specify a 64-bit integral type, use the LL, or ll suffix.

      1. Thanks for the collective answers. :)

        Why can’t the compiler decide that the way .NET Framework interpreter and Delphi compiler do? Is this some sort of power-developer feature?

        1. Brian says:

          Well, things like C++, C# and Delphi are different languages.
          auto myint = 0;
          auto mylong = 0L;
          auto myreallylong = 0LL;
          Create two 32-bit numbers (one an int and the other a long) and a 64-bit “long long” in MS C++ (remember, C++ does not specify the bit length of it’s types). In C#:
          var myint = 0;
          var mylong = 0L;
          specify 32 and 64-bit integers (in the .NET world, the bit-length of integral types is part of the standard).

          1. In way, I was asking why “C++ does not specify the bit length of it’s types”? But I guess you implied the answer: The same reason that Wright brothers’ plane didn’t have jet engine. So, thanks.

        2. Wear says:

          .NET actually has the same issue.

          long l = 1024 * 1024 * 1024 * 100; -> “Error CS0220: The operation overflows at compile time in checked mode”
          Dim l As Long = 1024 * 1024 * 1024 * 100 -> “Error BC30439: Constant expression not representable in type ‘Integer'”

          The compiler treats the literals as int32s and preforms int32*int32 multiplication on them which overflows. If you add the L suffix everything works because now the literals are all int64s and you are performing int64*int64 multiplication.

          long l = 1024L * 1024L * 1024L * 100L;
          Dim l As Long = 1024L * 1024L * 1024L * 100L

          1. Yes. Interesting how I never run into this on .NET: I never had to manually allocate a very large number to my variables during my career.

          2. cheong00 says:

            Actually, you might need the suffix to declare constants for use in Interop too.

            Taking example for a recent support case in MSDN forum:
            Public Enum ACCESS_MASK As UInteger
            ‘…
            GENERIC_READ = &H80000000UI
            ‘…
            End Enum

            Try take away “UI” at the end and see if it can compile.

  3. Scarlet Manuka says:

    Reserve disk space with this one weird trick!

    (Sorry, couldn’t help myself. Looks like quite a neat solution actually.)

    I agree with Brian: it’s not a bad thing to make sure you have space to write to your log file. If something goes wrong and the disk fills up, at least you can write ‘couldn’t generate output, disk full’ to your log file. Yes, you’d probably find out from disk usage monitoring, but having it in the log can save time troubleshooting, particularly if it’s only a brief condition – for instance if your app cleans up a large output file after a failed write.

  4. Erik F says:

    This seems similar to fallocate() on Linux, which I have used a couple of times to achieve the same sort of result. File transfer programs don’t seem to use it but I think that this method would be handy when you are copying a large file because you can guarantee beforehand that the copy won’t fail due to lack of space on the destination. I’m sure there’s a good reason but off the top of my head I can’t think of what that might be.

    The documentation for SetFileInformationByHandle() seems to imply that not all file systems support all features: is there any documented guidance regarding what common file systems support which information class?

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index