One might think that computing the size of a directory would be a simple matter of adding up the sizes of all the files in it.
Oh if it were only that simple.
There are many things that make computing the size of a directory difficult, some of which even throw into doubt the even existence of the concept "size of a directory".
- Reparse points
- We mentioned this last time. Do you want to recurse into reparse points when you are computing the size of a directory? It depends why you're computing the directory size. If you're computing the size in order to show the user how much disk space they will gain by deleting the directory, then you do or don't, depending on how you're going to delete the reparse point.
If you're computing the size in preparation for copying, then you probably do. Or maybe you don't - should the copy merely copy the reparse point instead of tunneling through it? What do you if the user doesn't have permission to create reparse points? Or if the destination doesn't support reparse points? Or if the user is creating a copy because they are making a back-up?
- Hard links
- Hard links are multiple directory entries for the same file. If you're calculating the size of a directory and you find a hard link, do you count the file at its full size? Or do you say that each directory entry for a hard link carries a fraction of the "weight" of the file? (So if a file has two hard links, then each entry counts for half the file size.)
Dividing the "weight" of the file among its hard links avoids double-counting (or higher), so that when all the hard links are found, the file's total size is correctly accounted for. And it represents the concept that all the hard links to a file "share the cost" of the resources the file consumes. But what if you don't find all the hard links? It it correct that the file was undercounted? [Minor typo fixed, 12pm]
If you're copying a file and you discover that it has multiple hard links, what do you do? Do you break the links in the copy? Do you attempt to reconstruct them? What if the destination doesn't support hard links?
- Compressed files
- By this I'm talking about filesystem compression rather than external compression algorithms like ZIP.
When adding up the size of the files in a directory, do you add up the logical size or the physical size? If you're computing the size in preparation for copying, then you probably want the logical size, but if you're computing to see how much disk space would be freed up by deleting it, then you probably want physical size.
But if you're computing for copying and the copy destination supports compression, do you want to use the physical size after all? Now you're assuming that the source and destination compression algorithms are comparable.
- Sparse files
- Sparse files have the same problems as compressed files. Do you want to add up the logical or physical size?
- Cluster rounding
- Even for uncompressed non-sparse files, you may want to take into account the size of the disk blocks. A directory with a lot of small files requires up more space on disk than just the sum of the file sizes. Do you want to reflect this in your computations? If you traversed across a reparse point, the cluster size may have changed as well.
- Alternate data streams
- Alternate data streams are another place where a file can occupy disk space that is not reflected in its putative "size".
- Bookkeeping overhead
- There is always bookkeeping overhead associated with file storage. In addition to the directory entry (or entries), space also needs to be allocated for the security information, as well as the information that keeps track of where the file's contents can be found. For a highly-fragmented file, this information can be rather extensive. Do you want to count that towards the size of the directory? If so, how?
There is no single answer to all of the above questions. You have to consider each one, apply it to your situation, and decide which way you want to go.
(And copying a directory tree is even scarier. What do you do with the ACLs? Do you copy them too? Do you preserve the creation date? It all depends on why you're copying the tree.)
The same fun comes when making quota decisions! Was that literal disk block bytes or user-data bytes and which user-data bytes would that refer too, the primary data-stream or the alternate and reparse streams as well?
Regarding sparse files, is there any compelling reason to use them instead of just using a standard file structure? I’m assuming that the programmer of the application wants to allocate a larger block of space to allow for file expansion, without create file fragments all over the place.
I took a look at a MS link:
http://www.microsoft.com/resources/documentation/Windows/XP/all/reskit/en-us/Default.asp?url=/resources/documentation/Windows/XP/all/reskit/en-us/prkc_fil_aixf.asp
It appears that sparse files are controlled at the application level, not by users, correct?
The NTFS change journal is one use of a sparse file. Sparse files are also handy when you know that most of a file isn’t going to be used but you don’t want to design an internal directory structure.
In XP, tooltips for folders give folder sizes )(approximate in some cases) – what is the algorithm behind that?
What he’s doing there might give you nightmares.
Raymond Chen blogs about the simple act of getting the size of a directory, not so simple after all, if you take into account all sorts of obscur…
Mat Hal: I don’t know why the Size column rounds up instead of to nearest. It seems to go out of its way to do that…
Sriram: I don’t know what the algorithm is; why don’t you experiment and report back.
All these features and issues have existed for decades on Unix where most software gets it right. Of course it is way easier to give the "-mount" flag to find than to convince Explorer to only copy files from the same filesystem (ie ignore junctions etc) And multiple streams/extended attributes make life even more fun.
However it is always worthwhile double checking the Unix Haters Handbook (available free online and in book form) where chapter 13 details many of the issues with Unix filesystems. Some of the issues are now fixed.
Chapter 14 covers the wart that is NFS. I once wrote an SMB/CIFS server and consequently know that Microsoft also did a hairy job in their networked filesystem.
I don’t think it’s possible to get it "right", just to have a generally accepted version that isn’t correct under all circumstances. Of course if you only care about being ‘good enough’ under most circumstances then windows already has easy solutions built in.
The question is wether junction points are like hardlinks or like symlinks in Unix parlance.
A hardlink is automatically followed and programs even have to use extra functions to determine if a file is actually a hardlink.
Hovewer, symlinks are NEVER automatically followed by any program (unless you explicitly tell it to do so).
You can not create a hardlink to a directory. So simple is the solution to this problem.
If junction points are more like hardlinks and every program has to do extra work to find out if a dir is hardlinked and otherwise they are followed automatically, well then this might be… harmful. :)
But I think this concept is more like bind-mount so its more like a mount point than a link.
Recursion through mount points needs to be explicitly forbidden (-xdev) in Unix too. (so the default is to follow)
But regular users are not trusted to mount things (~create junction points) and admins (presumably) don’t create infinite recursions and don’t bindmount directories in the wrong places. :)
Sickboy: In UNIX, every regular file is a hard link. Directory entries (filenames) are merely (hard) links to the inode, which is the actual file. Files usually only have one link, hence the confusion.
You can only tell if a file has more than one link (directory entry). When you have two or more links to the same file, they are all equal. You can’t say that one of them is the real file and the others are hard links.
Similar to the cluster-rounding, but sort of the inverse: If you have small enough files they’re probably MFT-resident and don’t take up any space other than their MFT entry.
By "bookkeeping overhead" you meant attributes, right? That gets more complicated as there are resident attributes, some that live in extents, and others that can live in other MFT entries.
There’s the added question of whether or not to count MFT entries at all. Are they in the MFT zone? If so those clusters are reserved for the MFT anyway so they don’t take up free disk space. Or do they? Depends on what you want to count.
Journaling could add even more questions about what to count.
All this talk of reparse points reminds me of a message I wrote in another forum in May 2003. (I assume nothing’s changed since then, but I don’t know for sure, and the message will explain why this is.)
*****
Junction points considered harmful?
I’ve been using junction points to relocate some folders to a more capacious partition, and it has worked very well. However, I was recently horrified to discover that unlike a hard link, deleting a junction point through Explorer deletes the target as well as the junction! Fortunately, I didn’t lose anything, but it pretty much turns a very useful feature into a very dangerous feature. It seems to me that if you use junctions at all, you can’t delete any directory without first considering whether it or some subdirectory many levels down contains a junction to something you want to keep. Not good.
I’ve been unable to solve this problem by using ACLs and would welcome any tips. Otherwise, I think I’m going to give up on junctions altogether. I think I understand now why Windows comes with no native tools for creating them.
*****
Note that your own programs can choose how to treat reparse points in SHFileOperation by using the FOF_NORECURSEREPARSE flag.
So when you right-click and pick Properties for a folder in Explorer, what does *it* do? You get the "size" and "size on disk" (which presumably covers the compression/cluster/overhead problem by giving you two "correct" answers), but how about all the other potential minefields?
(And on a sort of related note, what’s with the fact that the "size" column in the details view in Explorer is rounded up while the size reported in the status bar is rounded down? For example, a 225,718 byte file (220.4k) is reported as 221k in the size column and 220k in the status bar. It’s bugged me for years!)
Raymond – I haven’t tried out hardlinks yet – but my guess is that
(a) Explorer caches folder sizes whenever calculated. So for example, if I’ve used the ‘Properties’ dialog of a folder recently, it gives back the right answer instantly.
(b)For a folder containing other folders, it scans to see whether it has the children’s folder sizes cached. If all of them are cached, it sums them up and puts it in the tooltip. If only a few are cached, it sums up the few and says "More than" followed by the size of the children it knows the size of.
This leads to some silly tooltips – for e.g, a folder containing over 3 GB of sub-folders had a tooltip saying "More than 4.63 mb" which though correct, wasn’t very helpful.
I’m not sure what actions cause the folder sizes to be cached – but I’m sure there are more than the one I reported.For example, a folder whose contents I burned on to a CD recently using Nero had its size show up correctly in a tooltip.
A general rule of thumb seems to be that recently/frequently used folders report their sizes properly while lesser used folders don’y.
Have to play around with hard links(and the other cases you mention) and see what happens.
Re Junctions: Why shouldn’t the user interfaces that allow creating junctions check if the junction introduces "infinite loop" in the file system, and simply reject such actions?
I think the worst thing is to see only the effects of some decision, without considering the goals. There are two goals, from which we’re still far away on NT platform:
– Eliminate the logical drive letters that came from DOS.
– Allow system and data files on separate disks. Junctions should allow this one, I guess (never tried)? Other usage scenarios do not have to be reasonably supported. Therefore, the "normal" programs should traverse right through junctions and not avoid them.
Re Sizes: I also think the only reasonable solution is really to have more than one size, just like "Size" and "Size on Disk". Clusters practically existed always, so the "size on disk" was never the same as the logical size. Then came Stacker (remember that one?) and that was even more obvious.
AC: Blocking it at the UI level doesn’t stop someone from creating an infinite loop at the API level. And if you block it at the API level you get into the weird situation where for example, "I can’t attach my 1394 drive – the system tells me that adding it would create an infinite loop between my system drive and the new drive. But I can’t see what the infinite loop is until I mount the drive!"
Sriram: Explorer calculates the directory size recursively, but after 3 seconds it gives up and says "at least xxx" where xxx is the amount it found so far. That explains why it doesn’t see your 3GB hidden off in subdirectories – it didn’t find them in time.
I’m reluctant to go into detail because people might start relying on this behavior, which is clearly open to change at any time.
Joe: You can pass FSCTL_GET_NTFS_VOLUME_DATA into DeviceIoControl to get cluster size (and lots of other information).
This article stood out for me, since I’ve been working on a program to calculate folder sizes. Looks like I have a few bugs to fix now. :)
If people here are interested, check out my Explorer column shell extension at http://foldersize.sourceforge.net.
Sorry if this sounds like an advertisement… but with the topic here, this seems like a program that people might find useful!
The "cluster size" issue is even more problematic, given that there doesn’t seem to be any API call that will tell you what the cluster size *is*. GetDiskFreeSpace won’t report the correct cluster size on partitions larger than 2 GB, and GetDiskFreeSpaceEx doesn’t report the cluster size at all.
@Brio – That’s pretty neat, thanks. :)
AC: Logical drives actually have a really fantasic use: They’re short and convenient. It’s so much easier to see or type or browse to P: than /branch/mod1/[user]/data or \fservaccountingreciepts. I’ve seen linux shops set up aliases like /p for that purpose. An additional perk is that you can restrict which drives are visible in NT, and coupled with drive names make life simpler for the majority of users who aren’t really knowledgable and don’t need to see that stuff.
I once found a really cool "directory size" problem at a client. There was a Win2k server running out of space on C: drive. Free reported space was shrinking to under 10MB when I was called.
I selected all files on C: and found that the reported size was about 4GB, but the 10GB partition was full. So where was the 6GB going? I ultimately traced it to a bug in a custom program that would read the ACL, add a new entry, then *add the whole tihng back*. Since this was called on many small files, thousands of times a day… a few GB of data files took up an extra 6GB in ACL data.
Explorer wouldn’t show the ACL data as part of the file size, but free disk space didn’t lie :)
Raymond: Of course, blocking creation of infinite loops at the API level must take care of the tradeoffs, and the only reasonable solution for some occasions is that traversal functions implemented in OS do know how to recognize that they broke some reasonable limits (e.g. N levels of directory traversal). OS has the duty to keep itself stable and allow the user to correct errors (e.g. not to try to infinitely traverse junctions, in the user only wants to remove the problematic junction). Of course it can’t protect the user from all possible user’s errors, but each time some protection for some trivial error is introduced, it saves a lot of time for the whole world.
foxyshadis: Having the possiblity just to give shorter path is a trivial, and you really don’t need logical drives just for that. For me, the best consequence of having the logical drives is that then I can play with more "current directories" at the same time (and at he single command line). However it can be just a convenience and not something on what system depends.
Still, I would really prefer the system which doesn’t depend on fixed directory locations. And in Windows, you can’t move anything trivially, which is a pity. Wouldn’t it be great that you can install any application just by draging it into some folder. That you "uninstall" it by simply deleting it. The Windows platform is still so far from real simplicity and elegance.
I always found Spacemonger to be a handy program when wanting to know the size of directories on my disk: http://www.werkema.com/software/spacemonger.html
Also JDiskReport is nice too: http://www.jgoodies.com/freeware/jdiskreport/
Blog link of the week 53
My favorite file manager for Windows, the Total Commander http://www.ghisler.com has "Directory Size" function. Select a directory, press spacebar and see the directory size. It does not follow the links, but the information is accurate enough to decide how much one can free up deleting the directory, or will the directory fit the USB drive or floppy.
Why would one follow the link anyway, what is the practical sence in doing that? What is the size of Documents and SettingsUserStart Menu directory? Even better, what is the size of Documents and SettingsUserFavorites directory, where links point to URLs?
Matthew Lock and Michael J.: Those other programs are why I had to write the Explorer column extension. Why start up another program to see folder sizes, when they should just be right there, in Explorer, all the time? There’s an empty hole in the Size column right where that data should be.
Actually, the column extension API makes it look like I might be able to add data to an existing column, but I couldn’t get that to work. Would any column extension gurus know if that’s possible?
The "alternate data stream" scenario is especially grim. All released versions of NTFS have supported multiple data streams, but Windows Server 2003 is the first version with a *documented* API to enumerate the stream names.
It’s a very simple directory enumerator.
Introduction These are some tips for commonly faced problems in .NET . Some of these tips are mine and