Computing over a high-latency network means you have to bulk up

Date:April 7, 2006 / year-entry #125
Tags:other
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20060407-25/?p=31613
Comments:    65
Summary:One of the big complaints about Explorer we've received from corporations is how often it accesses the network. If the computer you're accessing is in the next room, then accessing it a large number of times isn't too much of a problem since you get the response back rather quickly. But if the computer you're...

One of the big complaints about Explorer we've received from corporations is how often it accesses the network. If the computer you're accessing is in the next room, then accessing it a large number of times isn't too much of a problem since you get the response back rather quickly. But if the computer you're talking to is halfway around the world, then even if you can communicate at the theoretical maximum possible speed (namely, the speed of light), it'll take 66 milliseconds for your request to reach the other computer and another 66 milliseconds for the reply to come back. In practice, the signal takes longer than that to make its round trip. A latency of a half second is not unusual for global networks. A latency of one to two seconds is typical for satellite networks.

Note that latency and bandwidth are independent metrics. Bandwidth is how fast you can shovel data, measured in data per unit time (e.g. bits per second); latency is how long it takes the data to reach its destination, measured in time (e.g. milliseconds). Even though these global networks have very high bandwidth, the high latency is what kills you.

(If you're a physicist, you're going to see the units "data per unit time" and "time" and instinctively want to multiply them together to see what the resulting "data" unit means. Bandwidth times latency is known as the "pipe". When doing data transfer, you want your transfer window to be the size of your pipe.)

High latency means that you should try to issue as few I/O requests as possible, although it's okay for each of those requests to be rather large if your bandwidth is also high. Significant work went into reducing the number of I/O requests issued by Explorer during common operations such as enumerating the contents of a folder.

Enumerating the contents of a folder in Explorer is more than just getting the file names. The file system shell folder needs other file metadata such as the last-modification time and the file size in order to build up its SHITEMID, which is the unit of item identification in the shell namespace. One of the other pieces of information that the shell needs is the file's index, a 64-bit value that is different for each file on a volume. Now, this information is not returned by the "slow" FindNextFile function. As a result, the shell would have to perform three round-trip operations to retrieve this extra information:

  • CreateFile(),
  • GetFileInformationByHandle() (which returns the file index in the BY_HANDLE_FILE_INFORMATION structure), and finally
  • CloseHandle().

If you assume a 500ms network latency, then these three additional operations add a second and a half for each file in the directory. If a directory has even just forty files, that's a whole minute spent just obtaining the file indices. (As we saw last time, the FindNextFile does its own internal batching to avoid this problem when doing traditional file enumeration.)

And that's where this "fast mode" came from. The "fast mode" query is another type of bulk query to the server which returns all the normal FindNextFile information as well as the file indices. As a result, the file index information is piggybacked on top of the existing FindNextFile-like query. That's what makes it fast. In "fast mode", enumerating 200 files from a directory would take just a few seconds (two "bulk queries" that return the FindNextFile information and the file indices at one go, plus some overhead for establishing and closing the connection). In "slow mode", getting the normal FindNextFile information takes a few seconds, but getting the file indices would add another 1.5 seconds for each file, for an additional 1.5 × 200 = 300 seconds, or five minutes.

I think most people would agree that reducing the time it takes to obtain the SHITEMIDs for all the files in a directory from five minutes to a few seconds is a big improvement. That's why the shell is so anxious to use this new "fast mode" query.

If your program is going to be run by multinational corporations, you have to take high-latency networks into account. And this means bulking up.

Sidebar: Some people have accused me of intentionally being misleading with the characterization of this bug. Any misleading on my part was unintentional. I didn't have all the facts when I wrote up that first article, and even now I still don't have all the facts. For example, FindNextFile using bulk queries? I didn't learn that until Tuesday night when I was investigating an earlier comment—time I should have been spending planning Wednesday night's dinner, mind you. (Yes, I'm a slacker and don't plan my meals out a week at a time like organized people do.)

Note that the exercise is still valuable as a thought experiment. Suppose that FindNextFile didn't use bulk queries and that the problem really did manifest itself only after the 101st round-trip query. How would you fix it?

I should also point out that the bug in question is not my bug. I just saw it in the bug database and thought it would be an interesting springboard for discussion. By now, I'm kind of sick of it and will probably not bother checking back to see how things have settled out.


Comments (65)
  1. Brad Corbin says:

    Don’t leave us hanging like that, Raymond!

    You have to at least let us know how they ultimately decide to resolve this one. I think its been a great discussion-starter, and a good peek into the true difficulty of programming in the real world.

  2. "By now, I’m kind of sick of it and will probably not bother checking back to see how things have settled out."

    My friends think I’m crazy because I’m always happy to help people find and fix their bugs, but you nailed it on the head with this last comment. If it’s somebody else’s bug, I get to have the fun mind-binding experiment and if it gets too tough, I get to say "sorry, I don’t have any more time." I rarely actually cut out on them like that (or else they wouldn’t ask me for help), but it’s always nice to have an escape plan.

  3. Keith says:

    SHITEMID; it always makes me laugh.

  4. JS says:

    If merely failing to plan one’s meals a week at a time makes one a slacker, I’m afraid I must downgrade myself from "slacker" to something even more unsavory.

  5. Dave says:

    "But if the computer you’re talking to is halfway around the world, then even if you can communicate at the theoretical maximum possible speed (namely, the speed of light), it’ll take 66 milliseconds for your request to reach the other computer and another 66 milliseconds for the reply to come back."

    Management says we need to work on making this more performant. How can we reduce this serious bottleneck for our customers?

  6. Brad Corbin says:

    Management says we need to work on making

    >this more performant. How can we reduce this

    >serious bottleneck for our customers

    We could drill a cable directly through the center of the earth. That should reduce the cable distance, but might increase maintenance costs. I’ll start working on some competitive bids :)

  7. Grant says:

    Dave,

    Running fiber through the center of the earth should increase performance by about 3.14 times.

  8. Centaur says:

    The documentation on BY_HANDLE_FILE_INFORMATION says:

    > Note that this value is useful only while the file is

    > open by at least one process. If no processes have it

    > open, the index may change the next time the file is

    > opened.

    > The identifier (low and high parts) and the volume

    > serial number uniquely identify a file on a single

    > computer. To determine whether two open handles

    > represent the same file, combine this identifier and

    > the volume serial number for each file and compare

    > them.

    So the file index is something that allows us to tell one open file from other open files, by comparing the index of the file in question with indices of other files we have opened. And, once we close the file, the index may lose its meaning, so Explorer couldn’t possibly be storing them anywhere. So the only meaningful thing Explorer can do with each file index, is to compare it with indices of other open files it has, and determine whether they are the same file, or not.

    This prompts two questions.

    First, if Explorer has no remote files open, the comparison will always yield false. What legitimate reason is there for Explorer to open any remote files?

    Second, what use is the knowledge that this particular file is the one it has open? What would it do when/if it found that out?

    I think Explorer has no business keeping any remote files open. And even if it does, I do not see why it would need to treat open files specially. So I would vote for removal of this unnecessary file opening/querying/closing.

  9. Grant,

    Management says that they need something more specific than "about 3.14 times". Please give the exact number.

  10. ChrisR says:

    Hmm, it seems to me that it would be pi/2 times.  The distance travelled halfway around the globe is pi, and the distance straight through the global is 2.

    Or maybe I need more coffee…who knows.

  11. Billy says:

    So as someone who telecommutes most of the time and is forced to use Explorer (and SourceSafe) over the VPN, how do I enable ‘fast mode’ in XP?

  12. /df says:

    Centaur

    "…What legitimate reason is there for Explorer to open any remote files?"

    Reading the file’s icon.

    Reading the MP3 ID tag.

    Reading the start of the file to determine its type à la Unix (does Explorer still do this?)

    Etc.

  13. Gabe says:

    Alas, I find myself longing for the days of File Manager, when viewing a directory listing on a remote server was no slower than just doing a DIR. I would be satisfied if Explorer’s "Fast Mode" would simply not open any files (no icons, no previews, no properties that aren’t in the directory).

  14. An unimportant remark that might be interesting to one person, at least: you don’t actually need to open the file to retrieve its icon. I once wrote an archive viewer that used standard shell API to get the icon for files in the archive, even when the files don’t exist in the file system. This simply uses the file’s extension, and so will not call icon extension (different ‘extension’) handlers (which would probably open the file).

  15. David Conrad says:

    I think you probably should have called it SH_ITEM_ID instead.

  16. DmitryKo says:

    "I’m kind of sick of it and will probably not bother checking back to see how things have settled out."

    Raymond, people will lose sleep if you won’t tell us the outcome… I certainly will :)

  17. Wang-Lo says:

    "…Tuesday night …time I should have been spending planning Wednesday night’s dinner, mind you. (Yes, I’m a slacker and don’t plan my meals out a week at a time like organized people do.)"

    Men often compensate for a lack of domesticity with meticulous grocery lists and weekly meal planning.  Women can run a household almost instinctively and rarely need to think more than one or two meals ahead.  So I guess you eat like a girl too.

    -Wang-Lo.

  18. steveg says:

    "But if the computer you’re talking to is halfway around the world, then even if you can communicate at the theoretical maximum possible speed (namely, the speed of light), it’ll take 66 milliseconds for your request to reach the other computer and another 66 milliseconds for the reply to come back."

    We’re putting a satellite into VLOE (very low earth orbit) to reduce network latency between Australia and the US. We are simply going to dig a tunnel (or use the existing Verne tunnel in Iceland, depending on the cost of shipping the satellite there) to the centre of the earth, and place the satellite there.

    The brilliance of this plan is we only need to use enuf power to send the signal halfway through the planet, the satellite relays the signal through the other half (free thermal power, you see). We are having heat disappation issues with the P4 chips we’re using, though…

  19. /df says:

    Justin Olbrantz

    "…you don’t actually need to open the file to retrieve its icon…"

    It can be debated whether it was a good design decision,  but Explorer wants to display the exact shell icon which is not always a generic icon for the file type: for instance, the Word icon in winword.exe, or the file itself for *.ico. As a result, it’s confusing for users to see a file listing with different (generic) icons. Potentially file-specific icons could be overridden for remote filesystems, but as in newer versions of Windows it’s better to hive off the icon updating to a background thread. You can wait a long time to display a remote folder in Win98 if its contents are executables or icons.

  20. jim says:

    People plan meanls?

  21. Gabe says:

    I should point out that typical files that Explorer opens to get the icon are EXE and ICO files. However, in theory any iconhandler can access the file to get its icon. For example, Photoshop could install an iconhandler that opens up PSD files to user their preview as an icon.

    It would be nice if there was a "fast mode" in Explorer where it just acted like Fileman (i.e. ignoring icon handlers), so all operations would be fast.

    Incidentally, it would be appropriate to mention that the reason NSS (native structured storage) never shipped is that it was horrible for use over the network. If a document had 10 streams in it, reading the file would have 10 round trips to open them, 10 round trips to read them, and 10 round trips to close them. When they’re all stored in one file like they are now, you would only have 1 round trip to open, 10 to read them, and 1 to close. This means NSS would only be useful on local files, but most of the users with complex docfiles (the ones that would benefit most from NSS) keep most of their document on file servers. It was a shame because the technology was so cool and lots of people spent lots of time on it, but it just didn’t provide any real benefit.

  22. Chris Moorhouse says:

    "By now, I’m kind of sick of it…"

    And the lousier suggestions, of course, although I note that your dribble-the-facts-out strategy has managed to filter out pretty much all the finger-pointing by this post. :D

    Seriously, though, this was excellent. I know I got to sit down and figure out what "should" be done, and what I would probably "would" do, given my own business situation. My "final draft" was two pages of reasoning and balancing; even my own real code issues haven’t gotten me to analyse them that heavily.

    More than what gets done about it, I’d like to know why, but for various reasons I kinda doubt that information will be available. At any rate, it’s been very entertaining and useful, and I hope you are willing to share other stuff in as mush depth as this.

    As an aside, this is one of the first topics on which I’ve noticed your own post-post comments deal with the meat of the issue, rather than responding to some of the more pointless commentary (my own excluded). It’s nice, but it also seems to tell on you.

  23. If I make a post-post comment, it’s always a brief one. The more substantial commentary I convert into new posts and toss into the queue.

  24. Reinder says:

    "Running fiber through the center of the earth should increase performance by about 3.14 times."

    Make that 1.5. I can see a physicist round π/2 to 2 (for large values of π, if you will), but not to 3, and certainly not to anything with two decimals that is larger than 3.

  25. Iain says:

    On the subject of Explorer and networks – I use Windows 2000 at work. If I’ve pointed explorer at a \UNCc$ address… and if I have the Folders pane open on the left…

    Then Explorer merrily loads every server and workstation it can find in our whole domain into that explorer tree. This is tens of thousands of machines. !$*^!!

    Does anyone know how to stop it doing this?

  26. yeah, close the folders pane.

  27. __int128 says:

    I just got around to reading up on this whole slow-mode/fast-mode issue, and it might not even be relevant anymore, but I’m going to write my thoughts on it anyway.

    1. Under no circumstance should Explorer prompt the user. It’s not the user’s problem.

    2. Disabling fast-mode by default is a really, really, REALLY bad idea. You don’t disable features just because they predictably misbehave in 1 out of 100 cases. You implement workarounds for those scenarios.

    3. My current main idea, which I don’t think has been invalidated, would be to upgrade the protocol to ask for version info as well (at least if it’s a Samba). If it’s a Samba and it doesn’t reply with a version string, then you go into slow mode. They’ll fix it and make it work. They shouldn’t have released untested code in the first place. If they have been able to implement the feature, they should have been able to test it right as well. Of course, extending the protocol like this might not be backward-compatible. You might choose to integrate the version information directly into the family id, such as Samba|3.0.22 instead of just Samba.

    4. Another way of ensuring backward compatibility would be to always connect using fast-mode, but during the first FindFirstFile you would perform as many roundtrips as needed to determine whether you’re running on a buggy server. If it’s a buggy server, you just do it all again using slow-mode. All of this happens inside FindFirstFile(), before sending any information upstream.

    I had a couple more ideas, but I think these two should cover it. I’ve become quite adept over time at thinking up ways of hacking code for compatibility reasons.

    Under no circumstance should things break in new versions. The only change users will see will be your new code, and all the blame will fall on you, rather than on the 3rd parties.

  28. Repost as no one noticed that last time I posted it :-).

    FYI: This is the fix that tridge suggested to Raymond. No reason why this shouldn’t work – and be completely transparent to application code.

    No reason to bother userspace code about it,

    no reason for GUI changes or looking for

    specific versions or detecting Samba as

    opposed to any other server, no need to

    keep things in "slow" mode now the bug

    is fixed.

    Jeremy

    From tridge:

    "If we had run across the error you

    described (INVALID_LEVEL from a continue)

    then we would have added a bit flag on the

    current connection structure to mark this

    connection so it won’t use that level in

    future, then repeat the search using a

    different level. That means you would get

    one useless search on the network with each

    connection to a buggy server, but no impact

    against non-buggy servers and no user observable

    affects. The denial of service attack you

    mention with this type of fix doesn’t happen

    as the extra bit is per-connection, not long

    lived (trying to remember long lived info

    about specific servers is a losing game)."

  29. Mark Steward says:

    Conscious that a lot of people are bored of this (and that I got it a bit wrong last time), I’ll try to keep this brief:

    Tridge’s suggestion does have the race problem, but as mentioned, there’s a race condition with > 100 files on Samba anyway.

    Will the user blame the server if you just merge the queries and they see both names for a renamed file?  Hopefully, although I know quite a few who would just blame Microsoft’s protocol.  Perhaps you could show the error message iff the second query comes up with a different listing?  Then everyone’s happy :-)

  30. Jeremy, reposting since you didn’t notice my response.

    The fact that you fixed the problem quickly was noted in the original article.   It is also completely and utterly irrelevant to Raymond’s post.

    There are distributions out there that have the bug, and there are embedded devices (NAS boxes) that have the bug.  That means that Microsoft needs to come up with a workaround for YOUR bug.  Because you don’t have a mechanism in place to ensure that ALL the devices containing the buggy code are fixed (you depend on 3rd parties to pick up the fix ad-hoc), there is no way for Microsoft to ensure that all devices have the fix.  

    The consequences of NOT working around the fix are catastrophic (users will believe they lose data), and they will blame Vista for the problem, and not your code (they have no other way of knowing it’s your old bug, they installed vista and thousands of their files disappeared).

    This is the hell of developing commercial solutions that are deployed to hundreds of millions of customers.  You MUST fix all the bugs in the other guys products.

    It’s WONDERFUL that you guys jumped on the bug and fixed it quickly.  It would be even MORE wonderful if everybody who picked up the buggy code would fix their distribution.

  31. I guess I didn’t follow the solution. How does one "repeat the search using a

    different level" without having to keep track of the items returned by the previous search and then doing merging/filtering?

  32. You don’t keep track of the items returned by the previous search, you junk them. You just re-issue the search from scratch with a different info level and mark that server (in memory) as unable to use that info level.

    The "race condition" complaint is bogus, as there are no consistency guarentees in the face of renames in SMB/CIFS anyway.

    Jeremy.

  33. But how do you "take back" all the items you returned from the first request? The first call to NtQueryDirectoryFile succeeds; it’s the second call that fails. How do you go back and say, "Oh, that first call that succeeded? Yeah, um, I’d like to change my mind. It actually failed, sorry."

  34. You didn’t understand why I posted the response.

    Yes, I understand all the above. I understand it’s irrelevent that the bug is now fixed. Microsoft must simply work around the buggy code out there. This isn’t a big deal, and is the kind of thing they have to do for older versions of their own software.

    What I was showing was that working around this is *easy* and does not require any of the drastic responses that people have been posting in these forums. No need to detect version strings, show dialog boxes to users etc. Just a simple, low level, transparent workaround, which works without any issues for buggy and non-buggy code out there.

    This is same kind of workaround we do all the time to deal with ‘differences’ (some could say ‘bugs’ on occasion) in Microsoft clients and the way they drive Samba as a server. This is the sort of thing that client and server vendors have to do every day, it’s called "interoperability testing" and if Microsoft bothered to turn up at the CIFS conference, or even at Connectathon then they’d find engineers from all the major server vendors ready and eager to help work through these issues to make life better for all our customers.

    Rather than spending the money creating the nice-looking but ultimately irrelevent port 25 web site I’d like to see them use that cash to fund some travel for the client and server engineers so they can actually learn to play well with others in the industry. They’d be welcomed with open arms.

    Jeremy.

  35. I missed this part, which is quite humorous :-)

    LarryOsterman wrote : "This is the hell of developing commercial solutions that are deployed to hundreds of millions of customers.  You MUST fix all the bugs in the other guys products."

    Tell me about it :-). Ok, we don’t deploy to *hundreds* of millions of customers, but we do deploy to *tens* of millions of customers.

    Believe me that’s quite enough to have to fix all the bugs in the other guy’s products too :-). If Windows clients break with Samba, they don’t blame Windows, they blame Samba. Trust me on this – remember, because CIFS has no spec, whatever Windows clients do is considered "right", and thus by definition if we don’t do the same we’re wrong. I have some *very* large customers who beat me with this every day :-) :-).

    Jeremy.

  36. Gabe says:

    It sounds like tridge is suggesting just returning an error on the first enumeration from a slow server.

    However, it does not sounds like this is necessary. I don’t see why the server can’t do this:

    1. Issue FindFirst 128 files FAST command to server

    2. Return first 128 files to calling program

    3. Issue FindNext 128 files FAST command to server

    4. If this works, return files as expected and ignore the rest of these steps; if this fails, close the FAST search and mark this server as buggy

    5. Issue FindFirst 128 files SLOW command to server

    6. Discard the results, assuming they will be the same as the previous 128 (they may not be, but Samba doesn’t guarantee consistency anyway)

    7. Issue FindNext 128 files SLOW command to server

    8. Return files to programs using SLOW mode from now on with this connection

  37. Mark Steward says:

    On the topic of roundtrips, does anyone have an opinion on requesting only one file in FindFirst, and counting the added roundtrip as a "handshake" for Fast mode (you’d be roundtripping for the ID in slow mode anyway)?

    Is cutting out the roundtrip more important than accuracy in Explorer?  Is there a way to cut out the roundtrip in future clients (without the hassle of a new dialect)?  Is it a bit too hacky?  Is it impossible (i.e. the protocol doesn’t allow it)?

    Cheers,

    Mark

  38. Moi says:

    SH_ITEM_ID

    I’m glad I wasn’t the only one whose brain threw an exception when parsing SHITEMID :-)

  39. BryanK says:

    Raymond:

    > But how do you "take back" all the items you returned from the first request?

    You don’t.  You never return them to the client to begin with; you’re thinking at too high of a level.  The fix must be done in the redirector.  (Not the implementation of FindFirstFile or NtQueryDirectoryFile or whatever, the CIFS client code that decides which packets to send when.)

    This client code sends one packet with a file-count of 1 (call it a "fast-mode handshake", as Mark Steward said; I do like this idea), then sends another packet with a file-count of 100 (this is a good optimization of my original proposal, which was send a second packet with a size of 1).  Do not return ANYTHING to the client until the response from the second packet comes back.  If it’s an error, then restart in slow mode.  If it’s not an error, then start returning the 101 results to the client.

    As you said, slow mode costs 1 roundtrip per 100 results, plus 3 per file.  Today, fast mode costs 1 roundtrip per 100 results (without the extra 3n roundtrips).  With this change, fast mode would cost 1 roundtrip per 100 results, plus 1; the buggy Samba servers would cost 1 roundtrip per 100 results, plus 3 per file, plus 2.  XP-style "use slow mode all the time" would be the same as it is today.  Vista-style "use fast mode all the time" would be the same as it is today (faster, but inconsistent when looking at a Samba server with the bug).

    These extra 1 or 2 roundtrips take time, yes (especially over a high-latency link).  But compared to the latency introduced by the rest of the code (especially slow mode, which is what everyone’s used to today!), it’s tiny.  Especially when there are large numbers of files and you’re using fast mode; the extra hanshake time is a constant cost, where the rest of the roundtrip times are all O(n).

    But this post did explain something useful: The saved cost of fast mode is definitely significant.

  40. BryanK: Yes, I understood this as your proposal. I was reacting to Jeremy/Andrew’s, since they claim "no impact against non-buggy servers" – but your proposal does have an impact on non-buggy servers (though a small one, namely one additional round-trip).

  41. Nekto2 says:

    Agreed to Gabe’s idea

  42. Raymond,

    > But how do you "take back" all the items you returned from the first request?

    BryanK is correct. You don’t take them back. You’re thinking too high a level. Calls like  NtQueryDirectoryFile should mean nothing to the  redirector – it’s just told to fetch directory contents  when going through the underlying Windows VFS layer. What the client API’s ask for is irrelevent – so there’s nothing to take back – the call hasn’t returned to the client application yet.

    Whenever you get a "scan this directory for this pattern" at the redirector level you just issue at least two calls – the findfirst and findnext. This readahead will let you know if you’re talking to a server with a bug here, and allow you to redo the query with a different level if not.

    Jeremy.

  43. In which case I don’t get it. If the error comes back on the continuation request, why does it fail at the 129th file instead of the 2nd?

  44. The continuation request is a different network protocol requst number (SMBfindnext) than the first (SMBfindfirst). The bug lies in the two missing lines in the switch statement. So the Samba SMBfindfirst code handles the new info level correctly, the Samba SMBfindnext code returns NT_STATUS_INVALID_INFO_LEVEL instead (at least the buggy versions do).

    So at the redirector level you always issue a SMBfindfirst call followed by a SMBfindnext with the new info level. If you get back correct results then great – not a buggy server, just cache the values returned and prepare to return them up through the VFS later. If you get back NT_STATUS_INVALID_INFO_LEVEL atfter the SMBfindnext request you just junk both replies and re-issue an SMBfindfirst and SMBfindnext with lower info level (possibly then doing another SMBfindfirst/SMBfindnext pair to get the extra info you would have got first time from a non-buggy server). Then you mark that connection as "buggy" and make sure you never issue the new info level on that connection again as long as it exists.

    You never give anything back to the VFS until this is finished – it’ll all batched and asynchronous anyway, so this has no impact on non-buggy servers and only two extra round trips on buggy servers.

    Jeremy

  45. Okay, I see. Of course, there is additional virtualization work needed to regurgitate the cached values correctly. (The client’s second call to NtQueryDirectoryFile might passes a different sized buffer from the speculative request.)

  46. (However I don’t understand why the problem shows up on the 129th query if the wire protocol is "one file at a time".)

  47. BryanK says:

    Ah, OK, I get it.  I hope.  ;-)

    It may be worthwhile to do what Gabe has said, which sounds like it was built on Andrew’s fix (throw away the first 128 results from the *new* query when you requery in slow mode after seeing the error).  My hack is still faster than slow mode today — at least as far as Explorer is concerned — although it doesn’t really gain anything in terms of consistency.  This hack would be even faster, at the expense of … well, nothing that I can see at the moment.

    In any mode, the results might be inconsistent between two 128-file queries anyway, because CIFS doesn’t guarantee consistency (or at least, I don’t see how it could, short of a shadow copy on the server).  The same problem exists today; what happens if a file is changed between the first 128-file query and the second?

  48. No the wire protocol isn’t one file at a time. It’s one "block" of files at a time, where "block" is a maximum number of returns that will fit into the allowed buffer. The latancies would be horrible if it were one file at a time.

    Jeremy.

  49. In which case there will need to be virtualization.

    App says "Please give me 4KB of results."

    Client sends to server "please give me 4KB of results the fast way" followed by "please give me another 4KB of results" to see if an error occurs.

    If an error occurs, then the client switches to slow mode.

    If no error occurs, then the client returns the result of the first 4KB query to the application, and saves the second 4KB for later.

    The application now submits a 2KB query. The client has to hand back 2KB of results and save the other 2KB for next time. (Or if the application submits an 8KB request, the client has to hand back the 4KB result and then issue another 4KB query to fill the second half of the buffer.)

  50. Of couse. That’s how network redirectors work. Application requests don’t drive the network redirector directly – they cause the redirector to do things on the application’s behalf – but the redirector is in charge of how these things are done.

    Redirectors must already batch calls together to improve performance. I don’t understand why you’re having a problem with this – it’s standard network programming.

    It’s like using fopen()/fread()/fwrite() at the application level instead of open()/read()/write(). I know Microsoft programming guides seem to prefer people to use the system calls (possibly in order to tie people to Windows and not use portable constructs) but the performance will be massively improved if aggregating calls are used.

    Jeremy.

  51. I can see batching happening for common operations like reading and writing, but lower-level things like locking and "reading the contents of a directory directly" I assumed were more pass-through.

  52. DeepICE2 says:

    "I know Microsoft programming guides seem to prefer people to use the system calls (possibly in order to tie people to Windows and not use portable constructs) but the performance will be massively improved if aggregating calls are used."

    I think I now understand why microsoft does not send developers to the samba confrences – Who would want to go when the samba guys profess such "Evil M$" attitudes

  53. BryanK says:

    Locking, yes, that would (probably) have to be close to pass-through.  IIRC, that’s the reason that NFSv2 systems have issues with locking; because various operations (including locking) can be batched at the client.

    But why would reading the contents of a directory have to be pass-through?  I’m coming from the Unix "everything is a file" mindset, but isn’t a directory just a special type of file, whose contents are all the filenames in that directory?  (Along with a pointer to where each file’s info is stored on the disk, of course.)  If that’s true, then there’s really no reason to have any kind of difference (in the redirector anyway) between enumerating a directory and reading from a file.

    But maybe I’m assuming too much of a "sane" implementation of a directory in NTFS/FAT.  :-P  I know I’ve been assuming a bunch of things about the redirector’s operation throughout much of this topic (most importantly, that the redirector is what does the "batching", and it just returns results to the implementation of NtQueryDirectoryFile out of its local cache when possible), and those assumptions may be false.

  54. My thought was that the batching happened at layer above the redirector (but below the application). Otherwise, every redirector would have to reimplement all the batching logic.

  55. BryanK says:

    Raymond:

    > Otherwise, every redirector would have to reimplement all the batching logic.

    What I understand about redirectors makes me think that this is really where the batching should be done anyway, regardless of the duplication.  I am assuming that there’s one (and only one) redirector for CIFS, one for the local hard drive(s) (if one is needed for them), one for Exchange’s "M drive" (a view into the Jet database where Exchange keeps all the emails for every user), and one for any other required non-NTFS/non-FAT entity that wants to look like a filesystem.  (For example: WebDAV, NFS, etc., etc.)

    The optimal batching rules for all these could potentially be different (especially Exchange).  The difference in requirements for the cases is what made me think that the redirector is what did the batching.  Though I don’t know for sure.

    It probably is a decent-sized chunk of duplicated code, yes, but if some virtualized FSes need it and others don’t, then having it happen for all FSes may not be a good idea.

  56. DeepICE2 wrote :

    "I think I now understand why microsoft does not send developers to the samba confrences – Who would want to go when the samba guys profess such "Evil M$" attitudes"

    Don’t be silly, and stop using such childish abbreviations. The CIFS conference was a Microsoft *initiated* conference. It used to be held every year in Redmond (it’s a lot more convenient to me now it’s here in Santa Clara though :-). Connectathon is a Sun sponosored conference (but is attended by IBM, HP, NetApp, EMC – all Sun competitors). Microsoft would be very welcome at both of these events.

    I wish Microsoft pushed people towards more standards-based API’s that’s all (gssapi rather than sspi etc.). There are *reasons* for these standards you know. Some Windows apps have *horrible* network performance because of the "use system calls they’re faster" mentality.

    Jeremy.

  57. Ivan Cronyn says:

    I wish there was more stuff out there on dealing with high-latency scenarios. In my office environment (with a 100Mb/s connection between NYC and London, UK) we’ve had to implement Tacit (http://www.tacitnetworks.com/) to get acceptable file-browsing and are still battling with SQL Server.

    Does anyone know how to get SQL Server to stop waiting for acknowledgements after sending TDS data blocks? We’re currently maxing out at 80KB/s on our line, with the possibility of increasing performance 4x if we try messing with the "network packet size" settings on client and server, but this will still not be good enough and may cause instability.

    Surely TDS isn’t a worse protocol than FTP, where we see bursting all the way up to our 100Mb/s capacity?

  58. BryanK says:

    Well, I doubt that FTP requires any kind of handshaking above what TCP already provides.  (The data connection is only open for one file; each new file transferred gets its own data connection.)  If that’s true, then it would only be possible to provide performance that’s similar to FTP if the TDS protocol also didn’t require any higher-level synchronization.

    But it sounds like it does require more synchronization, since it’s waiting for something.  Maybe there are reasons for this requirement, I’m not sure (I’d guess that if there are, they have to do with data consistency).  But whether there are reasons or not, it won’t work like FTP if the protocol requires more synchronization.

  59. I am definately glad to hear that some work is being done on Windows Explorer for Vista with regard to WAN performance. I have done demonstrations for WAN accelerators (Riverbed Steelheads) and Windows Explorer for XP really is the poster child for applications that barf bigtime on a high latency connection. As a matter of fact, I always wondered why MS didn’t just buy Riverbed to incorporate some of their stuff into Windows?

    On a separate note, now that Samba is widespread enough for Samba<->Windows interoperability issues to matter to MS, then does that mean that maybe it is time for MS to share full documentation on the network interface with Samba? I may be wrong, but I assumed that they were still working out the interface by reverse engineering etc, which surely is more likely to lead to issues like this. Obviously MS has no reason to be overly charitable, but based on this sort of issue, it sounds like it is/will be in MS best interest. Probably look good to the European’s too :)

    ps: Raymond, I am glad you have a thick skin – as a highly visible blogger within the shell team you definately cop alot of abuse on behalf of MS and Windows that you don’t deserve. Keep up the great work.

  60. Neil says:

    "It’s like using fopen()/fread()/fwrite() at the application level instead of open()/read()/write(). I know Microsoft programming guides seem to prefer people to use the system calls (possibly in order to tie people to Windows and not use portable constructs) but the performance will be massively improved if aggregating calls are used."

    In my possibly limited experience there was only one case when stdio was faster, and that was in certain Windows 3.1 configurations when all the network I/O went through the real mode redirector so that each read required a transition in and out of real mode. Unfortunately for me the compiler I was using only supported 20 stdio handles and I needed 65.

  61. Cutting corners of a circle says:

    If by "halfway around the world" you mean the other side of the Earth, then the theoretical minimum latency is closer to 42ms rather than 66ms :)

  62. BryanK says:

    Neil: Maybe you didn’t notice non-stdio calls being slower, but I would bet that they were.

    stdio buffers everything.  Reading from memory is *always* faster than reading from either disk or network.  Therefore, reading from a buffer is going to be faster than reading from the network or the disk.

  63. Steve Loughran says:

    I’m fed up with blaming Samba, and dont think there is a good solution.

    But I think raymond is missing something when he says that enterprise users are complaining about how long explorer takes, so the solution should be to do queries in bulk.

    My issues are not just that things take ages, but that explorer hangs for a minute when trying to talk to a share from a PC on the lan that is switched off. And it hangs for a minute  if you mount a fully qualified host at work (\host1.example.orgc$) and then, outside the firewall, that hostname still resolves. It just plain sucks. What is wrong with doing stuff in a separate thread?

  64. I keep on doing this, clearly it’s evidence of a lack of imagination on my part…

    Raymond’s post a…

  65. So you can write on them.

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index