Adding a new flag to enable behavior that previously was on by default

Date:April 19, 2006 / year-entry #139
Tags:other
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20060419-14/?p=31473
Comments:    73
Summary:One of the suggestions for addressing the network compatibility problem was to give up on fast mode and have a new "fast mode 2". (Equivalently, add a flag to the server capabilities that means "I support fast mode, and I'm not buggy.") This is another example of changing the rules after the game is over, by...

One of the suggestions for addressing the network compatibility problem was to give up on fast mode and have a new "fast mode 2". (Equivalently, add a flag to the server capabilities that means "I support fast mode, and I'm not buggy.") This is another example of changing the rules after the game is over, by adding a flag to work around driver bugs.

Consider a hypothetical program that uses fast mode on Windows XP. It runs against a Windows Server 2003 server and everybody is happy. Suppose you make a change to Windows Vista so that it requires that servers set a new "fast mode 2" flag in order to support fast mode. When the customer upgrades their client from Windows XP to Windows Vista, they would find that their hypothetical program ran much slower. Whose fault is it? Not the hypothetical program that was using fast mode on Windows XP; that program is using fast mode correctly. Not the Windows Server 2003 machine; that server supports fast mode correctly. Is it Windows Vista, then, that is at fault?

"Hey, don't blame me," you answer. you answer. "It's that guy over there. That guy you've never heard of. He made me do it. Blame him!"

To describe this sort of behavior I like to steal a phrase from Albert Einstein: "Spooky action at a distance". (Einstein used it to describe what in modern physics is known as quantum entanglement.) In this particular situation, we have a conversation between two participants (the client software and the server software) mediated by a third (Windows) which collapses due to the mere existence of a fourth party not involved in the conversation! It's as if your CD player suddenly lost the ability to play any of your music CDs because some company you've never heard of halfway around the world pressed a bunch of bad CDs for a few months earlier this year.

Some people suggested, "Why not have a flag that says 'I support fast mode'?" Indeed that flag already exists; that's why Windows Vista was trying to use fast mode in the first place. The problem wasn't that the server didn't support fast mode. The problem was that the server had a bug in its fast mode implementation.

"Okay, then add a new flag that says 'My fast mode isn't buggy.'" Consider also how this course of action would look after a few revisions of the specification:

In response to the QUERY_CAPABILITIES request, the server shall return a 32-bit value consisting of zero of more of the following bits:

0x00000001  This server supports fast mode
0x00000002  This server supports fast mode and doesn't have the bug where enumerating a directory with more than 128 files fails on the 129th query
0x00000004  This server supports fast mode and doesn't have the bug where the long file name is reported incorrectly in the response packet
0x00000008  This server supports fast mode and doesn't have the bug where directories whose names consist entirely of digits are misreported as files
0x00000010  This server supports fast mode and doesn't have the bug where the enumeration resets if a file is created in the directory while the enumeration is in progress
0x00000020  This server supports fast mode and doesn't have the bug where FindNext returns failure even though there are still files to be enumerated
...

If a new capabilities flag were created for every single server bug that was discovered, the capabilities mask would quickly fill up with all these random bits for bugs that were fixed ages ago. And each time a bug was found in any one server, all servers would have to be updated to add the new capabilities bit that says, "I'm not that buggy server you found on April 8th 2006," even the servers sitting in a locked closet whose operating systems are burned into EPROMs. And if you're the author of a new server, which capabilities bits do you set? Do you claim that you don't have the bug where FindNext returns failure even though the enumeration hasn't completed? What if, six months after you ship, somebody finds a bug in your server of exactly that sort? I guess this mean that the next revision of the protocol will have to have a new flag:

0x00000020  This server supports fast mode and doesn't have the bug where it claims that it doesn't have the "FindNext returns failure even though there are still files to be enumerated" bug, even though it actually does have the bug, but in a more subtle manner

Or maybe you're convinced that you don't have any bugs in your "fast mode" implementation. Do you report 0xFFFFFFFF to say "I have no bugs at all, not even the ones people might discover later in other implementations"? What happens when the 33rd "fast mode" bug is found? Do we have to have a QUERY_CAPABILITIES2 function? If a capabilities bit is created for every single bug that ever existed in a networking protocol implementation, you'd have a few thousand capability bits all of whom mean "I don't have that bug where..."

Now, I'm not saying that this course of action is out of the question. Sometimes you have to do it, but you also have to realize that the cost for making this type of change is very high, and the benefit had better be worth it.


Comments (73)
  1. Name required says:

    "the capabilities mask would quickly fill up with all these random bits for bugs that were fixed ages ago"

    You mean the bugs that you complain are still in place on so many servers and network attached devices all over the world? Do you think they will go away in six months? A year? What? If so then you shouldn’t worry about this problem at all, because by the time Vista ships the problem won’t exist any more.

  2. vince says:

    Well the obvious solution is that Microsoft post the full CIFS and SMB specification, making sure to have Errata for all MS OS’s ever released.

    Giving people incomplete info and then asking them to solve the problem is only going to make both sides look foolish.

  3. Rutger says:

    I am glad you posted a followup…

    I would be surprised if a solution was found that would satisfy the majority of the readers of this blog (or even the minority for that matter). A lot of people see this as something to rant about.

    oh vince, you assume that if the specs are on the street (and they might be I don’t know) everybody is going to make a perfect implementation. yeah right…..

    I am glad that these kind of problems get attention, it feels like the problem you once described where a video driver would report yeah I can do that on any directx query that was fired at it, unfortunately I don’t think this one is as ‘easy’ solved.

  4. Joe Dietz says:

    Well in keeping with the law of fixes "do the fix at the lowest possible abstraction layer", you have to fix this problem at the protocol level (and make the protocol just a bit uglier) or just do nothing.  What wasn’t clear from the original problem description was that XP clients also use fast enumeration if available.  In that case XP clients interacting with old Samba servers already have been suffering from this for years.  That more or less tips the balance towards ‘do nothing’.

  5. vince says:

    you assume that if the specs are on

    > the street (and they might be I don’t

    > know) everybody is going to make a

    > perfect implementation. yeah right

    No, I assume if the specs were on the street, people here could have looked them up and posted intelligent solutions.

    As it is, people posted things based on the limited info that Raymond gave, and then he comes and shoots them down for being naive… well of course everyone is naive, they aren’t getting the full story.

    Then later the actual Samba developers come post, and they post much more useful info than anyone from MS gave, and it turns out even Raymond really didn’t have a good handle of what was going on.  So this whole series turns into a bunch of pointless speculating rather than any sort of good example of how to fix a bug.

  6. Dflare says:

    Joe, In Win XP, it uses the slow mode by default,   ( so basically every one is using it ), that’s why the problem is not really common. You can turn the fast mode on ( I don’t really know how to do it, but you can ). With WinVista, they are trying to make the fast mode the default ( so it can benefit from it ) and thats where the problem   will show to every one

  7. Neal says:

    I know zip about networks so I ask, how often do servers get swapped out invisibly (does windows monitor their presence or only see them when you attempt to access, is there some unique identifier that would change with any change)?  

    Why not add a service that is run whenever this network/protocol/whatever is available.  Add a hook to the appropriate network drivers so this service can be notified to do its business on connect or first access.

    The first time a connection is made it makes any necessary tests/queries to determine if a problem exists and sets mode accordingly.  Any time the connection is dropped the service requeries and resets the mode before next access.  

    Benefits – Aside from the hook, the network drivers don’t change.  New/multiple/selectable checks can be added by changing the service and not the network drivers (although the service could only work around bugs that could be fixed by mode or setup changes).  Don’t have to even run the service or checks if you’re certain of the hardware.

    Probs – several thousand windows machines doing test queries at once when they detect a server or disconnect/reconnect… lessened if checked only when an attempt to access is made?  Wouldn’t work if servers can be swapped invisibly.  Listening to a moron pulling guesses outta his behind?

  8. Bart says:

    Why i suggested this is that it will keep functioning, altough in a degraded manner.

    One alternative way of handling this is for both client and server exchanging capabilities and version/implimentation information, and then handshaking until a set can be found that both agree to support.

    Atleast in this way the ‘newest’ version can decide the right compatibility mode that both can support correctly.

    But you can’t add signatures and workarounds for yet undiscoverd problems.

  9. Bart: Whether you do it via a capabilities mask or by negotiation you have the same problem. With negotiation, you’ll have the client asking the server, "So, do you have that 129-file bug?" and the server will  say "Huh? What 129-file bug?"

  10. Bart says:

    You dont ask it if it has a bug, you ask it what it thinks it can do.

    Then you ask it what implementation it is.

    If you know about bugs in its implementation you remove those from the set of things it can do.

    Things it reports it can do but you don’t know about, you don’t support and you can safely drop those.

    The server does the reverse with you.

    This results in two sets of possible things that both think the other can do correctly, take the intersection of these sets and you have what is usable according to the information available to you.

  11. Joe Butler says:

    To Vince:

    No one is trying to make anyone look foolish.  The point of Raymond’s post, if I’ve understood it correctly, was not to obtain a free solution, but to highlight the issues involved in implementing certain fixes.  

    This obviously generated a lot of interest – because we like to solve problems.  

    Some of the suggestions were impractical (due to the negative impact they would have on end users), so this post was presumably made to focus on a smaller point and highlight some of the reasons why.

    Believing that everything would be fine if the specs were available is being unrealistic.  If I remember the orginal networking thread correctly, the developer of the samba issue admitted that it was an oversight on his part (it was not due to having limited knowledge of the SMB protocol).  It was also pointed out by someone else, that the 3 minutes or so end-to-end fix time seemed suspicously fast, if it included testing, etc.  Read between the lines of that comment and it might show the real reason such errors creep in.

  12. kbiel says:

    Raymond,

    Why are you reluctant to put the user in charge of this issue?  All that needs to be done to solve it is this:

    Vista uses fast mode by default.  Everything works great and the user is happy with the faster response in Vista until Vista encounters a samba server with the fast mode bug.  Vista pops a dialog saying, "An error occurred while attempting to communicate with \xyz in enhanced mode.  Using legacy mode might alleviaate the problem.  Would you like to reattempt connecting to \xyz using legacy mode?"  Include a check box "Always use legacy mode for \xyz?".

    That seems to solve the problem.  You have given a brief and non-scary explanation of the problem and placed the blame where it belongs, on server \xyz.  Then you have given a solution to the problem that the user can implement if they choose and permanantly if they prefer to not receive this message in future sessions.  Now the list of "bad" servers is controlled by the user, the user is aware of the issue and can possibly report it to the proper people, etc.

  13. Joe Butler says:

    To kbiel,

    Try this thought experiment:

    Remove all compatibility fixes in your OS (the ones that insulate you from bugs in other software/hardware).  Replace them all with ‘non-scary’ dialogs.  Make sure you use words that everyone will understand.  E.g. "\xyz", "enhanced mode", "legacy mode", "alleviaate", and throw in some big bang words like, "problem" just to make it less scary for the user.  Make that word "locale RED" so that it stands out even more.  Now pop that dialog up everytime an app calls an offending function (let’s assume it hasn’t crashed the PC, and that an unatended PC isn’t brought to its knees     while attempting to display the millionth ‘non-scary’ dialog.

    Now, go about your business and see how long it takes before you scream, "stupid $4it4ead fuc£ing computer!".

  14. Andy Blues says:

    I’m a little tired of this thread about this problem supposedly caused by the current Samba implementation.

    I am pretty sure that, before the moment Vista will be released to the general public, Samba will be already corrected and support FastMode.

  15. andy says:

    Andy Blues: The bug is already fixed in Samba (e.g. read KJK:Hyperion’s and Joe Butler’s comments just above).

    But the big problem is that a version of Samba with this bug is shipped together with lot’s of NAS, making it hard to fix for home users.

    In addition it is really interesting to read, for some people like me, about pros/cons of fixes for bugs. I especially liked the link to the "Spooky action at distance" making me read a little bit of quantum mechanics in the end :)

  16. "oh vince, you assume that if the specs are on the street (and they might be I don’t know) everybody is going to make a perfect implementation. yeah right….."

    Oh but how much harder is it to create an implementation when you don’t have specs at all!!  That’s the point.

    "What wasn’t clear from the original problem description was that XP clients also use fast enumeration if available."

    Yeah, I thought from the original description of the problem that is something newly implemented (or activated) in Vista.  

  17. Hyperion,

    Thank you very much for your comment.  You said all the things I was afraid to say.

    I wish that n00b Jeremy Allison who commented on the original post about this could see what you just said.  His response of why this happened in Samba was that it was a failure to communicate by Microsoft.

    James

  18. vince says:

    Hyperion:  I suggest you check out devel.samba.org.  I quote:

    > Important: In order to avoid any

    > potential licensing issues we

    > require that anyone who has signed

    > the Microsoft CIFS Royalty Free

    > Agreement not submit patches to

    > Samba, nor base patches on the

    > referenced specification.

  19. Nate Silva says:

    Does Microsoft have a validation suite that open source implementations can use?

    The Samba guys had to reverse engineer the protocol, so it’s not surprising they got some details wrong. Hiding the details from "competitors" like Samba harms Microsoft, too.

  20. "Hey, don’t blame me," you answer. you answer. "It’s that guy over

    > there. That guy you’ve never

    > heard of. He made me do it. Blame him!"

    It’s very funny to read you mocking this kind of answer Raymond, given it’s pretty much exactly what you said to me in a newsgroup posting some years ago. In fact, let me look at what you did say…

    > Because that idiot who wrote broken code shipped first.  Certain

    > idiots who write broken code are #1 on the software charts in

    > their area.

    Well I don’t know whether I’d heard of them, because you wouldn’t say who it was!

    As it happens it’s precisely this attitude from Microsoft that caused me to give up Windows altogether, both as a developer and a user.

    I just can’t deal with a platform where API contracts are broken

    willy-nilly in the service of backwards compatibility with nameless software.

    (the article was:

    http://groups.google.com/group/microsoft.public.platformsdk.ui_shell/msg/f46ba9e9d45c7bd3

    )

  21. Sean,

    Despite how much of a Microsoft fan I am, even I must admit that MS has upset me a time or two.  However, I content that MS is still far better than the competition.  Would you rather deal with situations like this?

    http://news.zdnet.com/2100-3513_22-6061491.html

    James

  22. peterchen says:

    Did I misunderstand the problem?

    As I read the original post, the SERVER is handling the request wrong, and the XP client runs happily in SLOW mode. Or am I mistaken?

    In this case, nothing would change for a Vista client, it would use slow mode until the server is updated and indicates "fast is ok (really now)".

    (and a version number could accumulate most of the bits)

  23. peterchen: But older servers (that don’t have the bug) won’t set the "really now" bit. Result: You upgrade to Vista and your performance drops.

  24. Mike says:

    While it is way too late in the game for this comment to have any effect on this particular bug:

    This would likely never have happened had Microsoft actually documented SMB/CIFS, and followed up on the "CIFS" thing (namechange, and provably non-honest "attempt" to define and open up the protocol).

    I also think this whole spectacle actually serves as fuel for the EU problem MS now faces. Microsoft has proven to be almost completely incompetent in the area of documenting this disaster.

    I hope Microsoft (finally) learns from this mistake, that communication protocols (be them over-the-wire, wrapped in RPC, syscalls or just plain function calls within a process) must be documented. Not just that the specific "call" exists and what arguments/parameters it expects, but also preconditions, invariants and postconditions.

  25. paul says:

    kbiel: Putting the user in charge only works if there is a user! Displaying a confirmation dialog assumes that if a user existed they would know what decision to take…

    This doesn’t take into account services without a visible desktop, software running on embedded systems or even ‘normal’ software on a pc stuck in a server room and never actually viewed directly – how would a confirmation dialog help in these scenarios.

  26. sayler says:

    peterchen: But older servers (that don’t have the bug) won’t set the "really now" bit. Result: You upgrade to Vista and your performance drops.

    Just to make sure that I understand, it’s really "and your performance doesn’t get any better" — because current clients use the one-at-a-time mode?

  27. sayler: No, your performance actually drops. Please go back and re-read the second paragraph.

  28. vince says:

    No, your performance actually drops.

    > Please go back and re-read the second

    > paragraph.

    So, how fast is this "fast" mode.  Can we see some numbers?  Are we talking a massive (like twice as fast) speed-up?  Or is it something like a 10% speedup that might be lost in the noise?

  29. BryanK says:

    > This server supports fast mode and doesn’t have the bug where enumerating a directory with more than 128 files fails on the 129th query

    I hope people realize by now that this isn’t a good description of this bug.  Yes, it does describe the behavior that a user program would see when running against a server with the bug.  But the bug was that the SMBfindfirst packet succeeded, and then the SMBfindnext packet (the second one in the enumeration) returned an error.  The 128/129 number is merely an artifact of the redirector’s batching process — if the redirector batched 64 files into one SMB packet, then you’d see the problem on the 65th file coming back from NtQueryDirectoryFile.

    (Unless the protocol forces a "count" of 128 in all SMBfindfirst and SMBfindnext packets?  That would seem extremely braindead to me.)

    That’s not to say that the right fix is to add a flag that means "I don’t have this bug", though.  (This is partly because of the reasons you cite, but also partly because requiring a server to turn a flag *on* to get behavior that it *already* does correctly seems completely backwards to me.)

  30. vince: I already gave this information in earlier entries in this series. Perhaps I should give up on this series since people just keep asking questions I’ve already answered.

  31. vince says:

    oldenewthing:

    > I already gave this information in

    > earlier entries in this series.

    > Perhaps I should give up on this

    > series since people just keep asking

    > questions I’ve already answered.

    I meant real numbers.  As in, ones measured on an actual system under normal behavior.

    Your previous asnwer started with "assume a latency of 500ms"…  do you expect most people to have their NAS servers available over a satelite link or something?

  32. Many companies have very high latency links; a half second is not unreasonable. The latency doesn’t depend on whether you are talking to a Windows XP workstation, a NAS box, a Windows 2000 server, or a Novell server. You don’t want to lose the benefits of fast mode when talking to a server that supports it.

  33. Have a field that allows the server to identify itself, by vendor name and version.  Compare HTTP’s Server: header.

    Build a database of workarounds like:

    * Vendor Acme versions 3.4 through 3.9 have the bug where enumerating a directory with more than 128 files fails on the 129th query

    * Vendor Bilix versions 12.7 through 12.11 have the bug where the long file name is reported incorrectly in the response packet

    * Vendor Colefa versions …

    Apply workarounds for the vendor/version/bugs in the database

    Make sure to give the vendor an opportunity to put a cap on the version number with the bug so that future versions of their software are not fighting a permanent workaround

  34. KJK::Hyperion says:

    vince, you have absolutely no bloody clue about the issue. The Samba people are kept updated about all such changes to the protocol by none other than Microsoft. They as a project own several copies of the Windows filesystem SDK

    =====

    "Fast mode" isn’t even an extension of the protocol, it’s just a new function code for the "find first" and "find next" operations, with its associated new data structure

    Some specifics: the new function code is FileIdBothDirectoryInformation, and the associated structure is FILE_ID_BOTH_DIR_INFORMATION. The former is just a number – if they knew how to encode FileBothDirectoryInformation, they’d have no problem with the new one. The latter is FILE_BOTH_DIR_INFORMATION with an extra field, of fixed size. Again, if they supported FILE_BOTH_DIR_INFORMATION, they should have no problem supporting FILE_ID_BOTH_DIR_INFORMATION

    (actually the new function codes are two, and the second is FileIdFullDirectoryInformation, but it doesn’t really matter, because the same considerations apply)

    Some background: the new modes return the file inode alongside attributes, name, short name, etc. They were introduced in Windows XP (like anyone who doesn’t live in the fairytown of "open protocols" knew), I presume to support natively the UNIX readdir() function, following the commitment of Microsoft into Services For UNIX. The alternative to enumerate the files and their inodes had historically been to open each file, query for its inode, and close the file, which is much much much slower and cannot even be batched – hence "slow mode" and "fast mode". I also have to assume that the inode is now used in Win32 as well, other than for the purpose of backups (BackupRead allowed you to retrieve the inode. In fact, BackupRead + BackupWrite used to be the only legal – if a bit roundabout – way to hard-link files before the official API)

    =====

    Information about *these* structures and constants is public and has been public for years now. Before it was, you could rely on the work of the likes of Gary Nebbet and Bo Branten, as all non-bigots like me did, proficiently. If you choose to be a "purist", insisting that those structures are part of the CIFS "protocol" (rather than the mere serialization of Windows I/O constructs they actually are) and should be documented as such, that’s your own damn choice to be a stubborn idiot

    =====

    So, what was the bug about? a stupid, stupid oversight. A goddamn typo. The "find next" operation returned an error code (STATUS_INVALID_INFO_CLASS, meaning FileIdBothDirectoryInformation isn’t supported) that was only legal for "find first" (the "checking the flag" Raymond mentioned)

    Whoever implemented FileIdBothDirectoryInformation (which was – ahem, "open protocols", ahem – *inexplicably* implemented to begin with) forgot to add a case: to a switch(). A stupid copy-paste error. A honest mistake

    Please stop blowing this out of proportion

  35. vince says:

    Many companies have very high latency links; a half

    > second is not unreasonable. The latency doesn’t

    > depend on whether you are talking to a Windows XP

    > workstation, a NAS box, a Windows 2000 server, or a

    > Novell server. You don’t want to lose the benefits

    > of fast mode when talking to a server that supports

    > it.

    Well yes, but since it has been years and only now is Vista turning on this mode, it can’t be that much faster.

    I’m an engineer… give me real world numbers, not worst case hand-waving estimates.

    Say a person is copying 100 power-point files to the Z: drive.  Say you run some sort of disk benchmark.

    Say you are doing a large compile of source off on a network drive.  Give us numbers for stuff like that.

    From your previous example it almsost sounded like the "fast" mode was added as a workaround to poor Explorer behavior, rather than because it really sped up the common case.

    The sad point is, probably none of us posting here could post these benchmarks if we wanted to, it would probably void the EULA.

  36. PaulJBis says:

    And that quote, vince, would be relevant if the original bug had really been caused by lack of knowledge of CIFS specs, instead of by a Samba developer forgetting to add a case to a "switch", *as Jeremy Allison himself mentioned*.

    Yes, I know, I know, an OSS developer making a mistake on his own. Gosh, is that even possible?

  37. vince: There is no real-world data for this specific issue because no beta version of Vista ever had "slow mode inode queries". When inode queries were added, they were added the fast way.

    I spent a half hour looking for something similar (you owe me a half hour of my life). One company computed that the extra Explorer network accesses on their network with 43ms latency was costing them US$35 million/year.

  38. Norman Diamond says:

    You don’t need additional bits in the protocol, or version number, or anything like that.

    The registry has enough space for users to use a Control Panel applet to store the usernames and passwords to be used in accessing servers.  There is room in the registry to add 2 bits for each of these servers to remember if the user said to enable fast mode or to disable fast mode or didn’t say, and another 2 bits to remember if you’ve detected success or detected failure or not detected yet.  As for what the default should be when the user didn’t say and you didn’t detect yet whether the server has this bug.  I just know that there’s room to store what you need to store.

    As for the badness of displaying error message boxes on unattended systems, well, don’t display an error message box when Windows Explorer isn’t browsing the affected server.  If Windows Explorer is open and browsing then it’s slightly safer to assume that someone intends to attend.

    Now, is it really possible to detect bugs, automatically store information in the registry, and publish articles telling users how to make adjustments?  In your favourite version of Windows, how many times did I open regedit and delete a NOIDE flag?  How did I know that this was the way to undo a decision that Windows had made when Windows detected something that it didn’t tell me about, but when I felt I had fixed the problem?  The answer is yes it’s possible, and SOME Knowledge Base articles are properly written and properly findable.

  39. Kuwanger says:

    "Would you rather deal with situations like this?

    http://news.zdnet.com/2100-3513_22-6061491.html

    James"

    Although not directed towards me, let me put in my own two cents on the subject of proprietary vs open drivers in an open OS.  One can actually draw something of an analogy with the current situation here.  The main problem, of course, is that samba had a bug in it.  However, to correct this problem is difficult because hardware wasn’t designed to be upgraded.  By analogy, many drivers have problems in them (bugs) and it’s very difficult to correct them because they’re proprietary.  The end result is that there tends to be no clean solution.

    At some point it becomes better to simply ignore the original hardware/drivers and start over with new hardware/drivers.  Now, clearly not all hardware/drivers are buggy, but I think it’d be silly to claim most aren’t the case.  But so long as there’s a simple means to update and fix those problems, it’s no longer a huge hurdle to overcome nor much reason to dump the hardware/drivers, that surely you’ve already invested a good bit of effort into, for something new that you will also have to invest greatly into.

    It’s for this reason that I’m interested primarily into buying things that are open.  While the open version might be buggier now, I won’t have to give up my investment at any point unless *I* choose to.  In the long term, it will probably cost me less to go with more open systems.  This is why things like DRMed hardware to only run signed binaries or ROM images that can’t be upgraded are bad.

    Of course, if making drivers and ROMs work well was actually not actually an area that was most prone to cost cutting (as they’re a necessity, not money drivers), perhaps they’d be less prone to be buggy and I wouldn’t have to invest much other than money into them (my time is probably more valuable, considering how much time is truly worth to anyone).  Until then, it doesn’t make sense to me to accept non-open software into things that they’ve such little vested interest in retaining proprietariness over.  After all, I’m buying the physical 3D video card or the physical NAS; being obsessed enough about the software end to keep it closed ignores that the software is useless without the hardware and people are buying specific products chiefly because of said good hardware.

  40. Jorge Coelho says:

    In that Newsgroup post mentioned by Sean, Raymond said: "You can say ‘It’s a bug in the application’ as many times as you like to the customer; the customer won’t believe you."

    Well, he might believe you, but deep inside he will really be thinking ‘So what? It’s Windows that crashes, not the other app, so _you_ fix it!’.

    I have a similar problem: my applications need to browse file system and virtual folders. Suddenly some customers started complaining that the software was crashing when trying to browse the Windows Control Panel.

    This usually indicates a corrupt control panel applet, but, in this particular case, Explorer itself seemed perfectly happy with it, instead of also crashing like it does (used to do?) with corrupt cpl files.

    After many hours trying to track down the problem, I finaly found the pattern: every system with the problem was running a Control Panel applet from a very popular vendor of web cams, keyboards, mice, etc…

    From there to figure out what was happening was relatively quick: when enumerating items in the Control Panel using the IShellFolder interface (isfFolder.EnumObjects, isEnumIds.Next, isfFolder.GetDisplayNameOf, etc…) at *some point* the vendor’s Control Panel applet corrupts my stack.

    Everything runs fine UNTIL my enumeration routine tries to return to it’s caller – at that point the application just crashes (it doesn’t even report an error, just disappears) because the return address in the stack has been blasted to oblivion.

    I’ve tried a couple of other applications (not mine) and they also crash when trying to display the contents of the Control Panel, so I’m sure it’s not a problem with my code.

    Now the big problem: since Explorer is able to defend itself from this kind of stack corruption and doesn’t crash, who do you think the user is going to blame? Worse, since Windows is apparently immune to this problem, the vendor has absolutely no incentive to fix this bug in their control panel applet.

    Because this is a very popular hardware vendor (which means a lot of my existing and potential customers will be running their buggy Control Panel applet), I tried contacting them through normal support channels, and, needless to say, I got nowhere – but, even if I had and the vendor fixed the problem, there would still be a lot of buggy old versions of that control panel applet floating around.

    With my software being written in Visual Basic, I currently haven’t got a clue on how to protect my stack from rogue Control Panel DLLs. So at least for the moment, I’m stuck.

    If only Windows, instead of ‘pretending’ the problem isn’t there with some clever hack, crashed as my applications and the others do, then the vendor would have found and fixed the problem long before it started shipping its buggy Control Panel applet!

  41. Jonathan Wilson says:

    Here is how I would deal with the origonal bug that started this series.

    From what I can tell, what happens is this:

    1.Some program asks windows to go get the directory listing for a network drive

    2.Windows asks the (buggy) server if it supports fast mode and the server says yes.

    3.Windows sends a fast mode query to the server.

    4.The server returns some records

    5.Windows returns something to the application

    6.The application iterates through and asks windows for the next record then processes that then asks for the next record then processes it and so on

    7.This continues on untill a request is made and windows suddenly says "I have run out of records". Presumably at this point (before it returns to the caller) windows will be able to detect that it is talking to the buggy server.

    Also, I am assuming windows knows how many records it has returned and where in the list it is.

    The fix is, once windows detects the buggy server, to request the records again in slow mode. Once it has the records, it can skip past the records it has already returned to the application. Then, it can return the item the application would have been given had the server not had the bug. And then things can continue.

  42. Here we go again….

    KJK::Hyperion wrote :

    "The Samba people are kept updated about all such changes to the protocol by none other than Microsoft."

    Not true anymore. We used to be quite close with SMB development (I still happily have my "Microsoft Confidential" copy of the NT SMB specs here in front of me) but that isn’t true anymore. Others can speculate as to why this is.

    Joe Butler wrote :

    "It was also pointed out by someone else, that the 3 minutes or so end-to-end fix time seemed suspicously fast, if it included testing, etc.  Read between the lines of that comment and it might show the real reason such errors creep in. "

    Oh for heavens sake. Have you never looked at a bug report and realized *exactly* what your cut-and-paste error was ? That was the 3 minutes to closure. It was such an obvious bug reported by a long-time collaborator that I trusted him to re-open the bug if it didn’t fix his problem. Later an extension to smbtorture was added to prevent regressions.

    Here’s a link to the description of the problem :

    https://bugzilla.samba.org/show_bug.cgi?id=3526

    and here (again) is the fix tridge proposed for Vista. Works fine, just needs someone to implement it.

    From tridge:

    "If we had run across the error you

    described (INVALID_LEVEL from a continue)

    then we would have added a bit flag on the

    current connection structure to mark this

    connection so it won’t use that level in

    future, then repeat the search using a

    different level. That means you would get

    one useless search on the network with each

    connection to a buggy server, but no impact

    against non-buggy servers and no user observable

    affects. The denial of service attack you

    mention with this type of fix doesn’t happen

    as the extra bit is per-connection, not long

    lived (trying to remember long lived info

    about specific servers is a losing game)."

    Jeremy.

  43. bramster says:

    Raymond. . .

    you are one smart dude!

  44. AC says:

    I think the solution is clear. The automotive industry invented vehicle identification numbers. We just need Software Identification Numbers. Every api from hereon out will take an additional SIN parameter, which can be defined as follows:

    struct SoftwareIdentificationNumber

    {

    GUID guid

    int version_major

    int version_minor

    int version_build

    int version_revision

    SYSTEMTIME releasedate

    SYSTEMTIME builddate

    DWORD vendorID

    DWORD productID

    DWORD revisionID

    DWORD reserved1

    DWORD reserved2

    DWORD reserved3

    };

    This way, instead of having this discussion we could’ve simply checked if the vendor/product/revision indicated that it was a particular type of buggy server, verified that the build date was before the bug was fixed, and decide to use the slow mode instead. Or we could decide that buggy implementations of undocumented protocols are not to be supported (doubly so for documented protocols), put out a KB article, and be done with it…

  45. Dean Harding says:

    Jorge: It’s possible that Windows actually isn’t doing anything specific to work around the problem, but just that their stack *happens* to be set up in such a way that the problem doesn’t appear. It’s quite possible, since I presume Logit- uh, the "vendor" only tested with Explorer.

  46. Amos Houndsbreath says:

    I think SIN’s are part of the solution, but won’t SAMBA just claim to be Windows XP or whatever the ‘right’ answer?

    I think you need to have some cryptographically secure way to see if it’s a trusted implementation – i.e. one from inside Microsoft or a licensed implemenation based on MS code and tested in the HQL labs, or an untrusted one. Put the untrusted ones in legacy mode, i.e. slow (XP speed) but reliable. The trusted ones will get the best possible performance based on the version level they report.

    It’s just like drivers need to be signed before they can be run, and the signing is tied to automatic testing. Hell if Microsoft are feeling generous (or scared of lawyers) they could offer to sign third party SMB implementations for a nominal fee after they’ve been tested.

  47. Paul says:

    Jonathan Wilson: The problem Raymond is trying to address is that currently there is no way for a server to identify itself as supporting fast mode, therefore no point in asking. If there was a mechanism introduced then none of the existing software, even the ones that did work, would report false until they have been updated. If a bug is present that prevents them working they may still report true anyway…

    Your point 7 is the original problem (as far as I can see) – the server says it has no more records to return despite the existence of further directory entries. This doesn’t seem to offer a way for windows to magically detect the existence of a buggy server.

  48. cheong00 says:

    Just some thought: While we do not have any indicator for systems that DON’T work, we have pretty reliable indicator(The "operating system" information for domain machine account, which Samba server don’t fill it) for machines that DO work.

    Perheps just popup warnings when running fastmode on systems that do not return "operating system" information will do.

    And then just tell the Samba developer to add that information (Perheps "samba X.X.XX") when registering machine account will be sufficient for advertising the "new version which don’t have this bug", and it helps to identify the version if other bugs presents too.

  49. Ilya says:

    From what I understand, while you cannot distinct various versions of SAMBA based on their handshake, you can distinct SAMBA and Windows. If so, why not assume all Windows servers are Fast-Mode compatible and all SAMBA servers are not, until they provide an extended handshake (e.g. identify by the name of "SAMBA-Foobar" or append a version number).

    As to the issue of an arrogant programmer passing 0xFFFFFFFF in the compatibility flags field, that’s the reason why I’d make it represent versions instead and make the Vista client aware of quirks in certain versions of certain products. It bloats the client a bit, but at least you remain in control.

  50. BryanK says:

    Paul: No, it doesn’t say "it has no more records to return despite the existence of further directory entries".  Instead, it returns an error to the second packet that it sees.

    From what I can tell, this is how it works (if this isn’t how it works, it should be how it works; clean separation of code and all that):

    User code calls NtQueryDirectoryFile with some set of parameters to indiciate "I just want the first one".  The OS asks the redirector for one file.  The redirector asks the server for 128 files in one batch, and sticks them all into a buffer.  It then returns the first result out of the buffer.  Then the program asks for the next result, and the next, and so on, with the first 128 results in total coming out of the redirector’s buffer.

    When the program asks for the 129th result, the OS asks the redirector, just like always.  But instead of returning stuff from the buffer, the redirector goes back to the server with an SMBfindnext packet — and the response from the server to this second packet is an error.

    (The handler in Samba for the SMBfindfirst packet type had support for the information level that Vista is requesting, but the handler for the SMBfindnext packet type did not.)

    Ilya, AC: Making a database of "broken versions" has problems other than bloat.  What is Windows supposed to do if no fixed version exists when it’s released?  Windows can’t see into the future; it doesn’t know that version Y will work when version Y doesn’t exist yet.  Sure, patches may work for that, but do you really want to be patching app-compat databases for the rest of your OS’s life?  And who’s going to force people to provide a correct SIN structure when calling APIs, versus just copying the SIN structure that worked in the past for some other unrelated program?  Or some program made by some other company even?

    Rule Number One of JavaScript / DOM programming in the browser:  NEVER base your behavior on user-agent strings or other methods of browser detection.  JS supports detection of whether certain methods exist; use that type of feature detection instead of browser detection.  (Well, whenever possible at least.  In some cases, you can’t tell by the existence of a property or method; the bug you’re seeing has more to do with the function you’re calling doesn’t do what the DOM standard says it should do.  But that’s rare.)

    In this case, the redirector can detect the error by sending an SMBfindfirst / SMBfindnext pair on the first request from a program, instead of only sending an SMBfindfirst.  (If it can’t because of the architecture of the OS / redirector interface, then that architecture is bad; there isn’t a clean-enough separation between components.)  Then the redirector would not return anything until the SMBfindnext succeeds.  Yes, this is one more round trip (and even one more for broken servers, but they should be rare), but it’s the only good way to detect this particular bug.

  51. vince says:

    I spent a half hour looking for

    > something similar (you owe me a half > hour of my life).

    Well considering how many hours of my life have been wasted dealing with MS software, I think you still have the advantage on me…

    > One company computed that the extra

    > Explorer network accesses on their

    > network with 43ms latency was

    > costing them US$35 million/year.

    I wonder how much that compares to the cost they had to spend on anti-virus and anti-malware due to other bugs with MS software?  Total Cost of Ownership is a complicated equation it seems.

    In any case, I re-read the posts making sure I didn’t miss more detailed bandwidth info, and I’ve changed my mind and decided this series is useful afterall, if only because it is eye-opening to see how many people posting here have a lot to learn about systems programming.

    Problems like this happen often, one example with linux was the ECN bit that helped reduce network congestion… only to turn out that most CISCO routers couldn’t handle it and would just throw out packets with the bit set.  

    It takes a lot of patience and time to get bugs like this fixed properly.

  52. vince: Was it me personally who cost you those hours? If not, then don’t take it out on me.

  53. peterchen says:

    raymond: re "You upgrade to Vista and your performance drops." –

    But the applicaitons that use a fast protocol run "happily" only until it hits a location with >128 files (?!). So these old apps don’t really work very well  – maybe it’s this tradeoff (working fast for most users broken for few vs. working slow for all until everything is upgraded) you shun.

    (sorry for being so insisting.)

    BTW.

    I’ve bookmarked the original post under "Why Windows is so complicated" – and I think Microsoft desparately needs a few more guys like Raymond. Software problems don’t end at checkin.

  54. peterchen: If you change the protocol to require all servers to set the "I am not buggy" bit as proposed here, then when you connect to an old non-buggy server (i.e., one that doesn’t set the "I am not buggy" bit because it didn’t need to before), Vista will run in slow mode, whereas XP will run in fast mode.

  55. Ilya Birman says:

    I should give up on this series

    Raymond, please don’t, this is very interesting :-) What you could do, however, is give link to previous articles like "make sure you read this, this and this before continueing".

  56. peterchen: Most servers support fast mode fine. That’s why the problem was found only recently. If no servers supported fast mode, then the bug would’ve been found as soon as Explorer switched to using fast mode!

  57. theres not really a chance to make it best afterwards…. the only thing you can do at beginning when making an api to externals is passing in an sort of an User agent string at the handshake. and the newer client can, if he want, take care of an "buggy" version or program.

  58. vince says:

    vince: Was it me personally who cost

    > you those hours? If not, then don’t

    > take it out on me.

    Was it my fault you didn’t have actual numbers on hand for the performance slowdown this causes?

    I’m still curious on the numbers, as I’m still worried this is a case of premature optimization.

    I’m willing to belive this is a huge problem, I just would like it quantified a bit more.  It doesn’t look like anyone has such numbers available, so I’ll just leave it at that.

  59. Vince: I’m sorry, I didn’t realize this was a peer-reviewed journal. I thought it was a blog.

  60. 8 says:

    James Summerlin: Just running an proprietary operating system is NOT a solution for proprietary drivers. And I do not worry about having a proprietary nVidia driver more then a weekly/monthly couple of minutes or so. There is nothing in the GNU GPL preventing me from using an nVidia or ATi driver also. It also does not prevent mere aggegration of those drivers on a given storage media. It *does* prevent GNU/Linux distributors to ship a kernel and Xorg with those drivers if they wanted to, provided nVidia and ATi agree as well. ZDnet covered an issue in great detail, perhaps that made the problem appear like big thing for all GNU/Linux users and developers to you, but in reality it isn’t. Eggagerated articles draw more attention, that’s a simple journalistic fact. The problem exists, sure, but you don’t have to worry about it. From personal experience, I find Linux better because I don’t have to install a driver for every PCI card I throw at it, only video acceleration.

  61. peterchen says:

    So there *are* servers where fast mode is working correctly? *Now* I understand your problem :)

    I simply assumed the broken server was the first time fast mode ever worked (more or less, in which case "now really works" is protocol change enough to warrant a flag/version check).  –  Sorry I missed this.

    > I should give up on this series

    please don’t. My vote now goes for "Requery in slow mode if the error occurs", and I’m looking forward to you picking it apart.

  62. Heh? says:

    Vista will run in slow mode, whereas XP will run in fast mode

    You mean Vista will get the correct results and XP quite possibly won’t? Is Samba support so deeply integrated into XP that it couldn’t be updated through Windows Update?

  63. "You mean Vista will get the correct results and XP quite possibly won’t?"

    That’s not what the customer sees. The customer sees their program that worked just fine on XP (because they used a non-buggy server) slow to a crawl on Vista.

  64. kbiel says:

    Joe Butler>> Remove all compatibility fixes in your OS (the ones that insulate you from bugs in other software/hardware).  Replace them all with ‘non-scary’ dialogs.

    Thanks for the strawman, but I’ll pass.  There are obviously bugs that the user never needs to know existed and can silently be corrected.  On the other hand, Raymond has set this up as a bug that will effect a user and can not be easily corrected silently.

    paul>> Putting the user in charge only works if there is a user! Displaying a confirmation dialog assumes that if a user existed they would know what decision to take…

    You bring up a valid point I hadn’t considered.  Then again MS is not adverse to taking the dialog route and just pumping the dialog message into the event log.  Check your application event log and you’re likely to find something similar:

    Event Type: Error

    Event Source: Userenv

    Event Category: None

    Event ID: 1058

    Date: 4/21/2006

    Time: 10:42:41 AM

    User: XXXXXXkbiel

    Computer: XXXXXXXXXXXXXX

    Description:

    Windows cannot access the file gpt.ini for GPO CN={XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX},CN=Policies,CN=System,DC=corp,DC=XXXXXXXXXXXX,DC=XXX. The file must be present at the location <\XXXXXXXXXXXSysVolXXXXXXXXXXXXXPolicies{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}gpt.ini>. (The network path was not found. ). Group Policy processing aborted.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

  65. David Conrad says:

    KJK::Hyperion:

    > "Fast mode" isn’t even an extension of the

    > protocol, it’s just a new function code for

    > the "find first" and "find next" operations,

    > with its associated new data structure

    If new features aren’t an extension of the protocol, what would constitute an extension of the protocol, in your view?

    Vince: Raymond did look for some numbers for you. On the other hand, he didn’t find any, so that’s not really worth much. I propose the following solution: in payment, you Fedex him a shiny copper penny.

    But please don’t Fedex it in "fast mode"; fast mode is buggy in the current version of Fedex.

    Now can we ring down the curtain on this reenactment of The Rape of the Lock?

  66. James Risto says:

    Sir, I am confused. Your original article said "XP always used the slow query". Yet your comments here say "upgrade to Vista and your performance drops". I don’t get it.

    Anyway, seems like a swamp. If we don’t want Vista to get blamed for that buggy-fast-mode server, then we do nothing and continue to use slow mode. Maybe a KB article for sysadmins to do something different if they know their environment is ok.

  67. James Risto: This article was a hypothetical scenario. "Consider a hypothetical program that uses fast mode on Windows XP." That program isn’t Explorer; it’s some other program that uses fast mode on XP.

  68. ender says:

    How does one enable the fast mode on Windows XP/2003? I’m regularly accessing a fileserver which takes almost half a minute to load certain directories, and wouldn’t mind trying a speedup :)

  69. "Fast mode" is already supported in Windows XP. You can read the Samba bug report for details on how to try it out. But I doubt it’ll help your scenario since you probably aren’t asking for the inodes.

  70. dorin says:

    I am sorry but isn’t the standard solution to such problem to fall

    backward to previous version when error is detected (In this case for

    the time of the session or hopefully just for that unlucky

    connection)?

    Faulty drivers, hardware and software will be always around there and

    idea to deal with them specially hacking original code and putting

    such hacks into dis is no smaller mistake than original bugs. For example because it can give not expected results with other software…

    Another question specifically to Hyperion:

    Why do you think programers in Samba team should know IFS? Better said

    why actually they should read windows documentation?

    As I see it:

    Samba team boldly declares that they provide SMB/CIFS services. Nothing more and definitely it is behind their scope to emulate CIFS of whatever Vista or XP.

    In my opinion it would be wise of MS once to stop this "hacking" attitude (using things that  work and slowly "documenting" them somewhere,  somehow), and start working as normal engineers do in at least some big companies. For example to think before final design how things can go wrong.

  71. KJK::Hyperion says:

    vince: how many times do I have to repeat it? those structures aren’t part of CIFS, at all. They are part of the Windows IFS (Installable FileSystem) API, which is now officially and 100% public (and used to be mostly public – through books, websites, Bo Branten’s ntifs.h project, etc. – for ages). I have known about the new enumeration types for years now

    And somehow, *inexplicably*, *unbelievably*, the Samba team has known about them for quite some time too!

    David: it’s not my fault if you don’t understand how the CIFS protocol works

    Roughly, the protocol has a packet type for each Windows I/O request. Most I/O requests have an unique way of passing parameters, so every new request necessitates a new packet type with an unique encoding. But many many many operations fall under the umbrella of very few requests – for example, all file enumeration operations (about a dozen) fall under a single request code (which also serves "query attributes" requests) and share the same parameters. CIFS is a file sharing protocol, so the expectation is that such parameters are passed unchanged to the local filesystem underlying a network share

    A "change of the protocol" would be the introduction of a new request/packet type, which will have an unique and unseen encoding of inline parameters (notable exceptions are output data and status code, which are returned in the same identical way for all requests). And even in that case, you can fail the request in a standard way (with a STATUS_INVALID_DEVICE_REQUEST error code)

    In fact, you have to design a CIFS server as a thin serialization layer over local services and drivers, because that’s the closest you can get to the real thing; then your underlying "driver" can handle (and fail) the requests like any standard Windows driver, in what is a publically documented behavior. Any other design can only be motivated by delusional concepts of "open protocols"

  72. KJK::Hyperion says:

    dorin: because. CIFS is designed as a very bare network serialization of Windows security (SSPI) and I/O (IFS) APIs. CIFS is not an open standard, CIFS is a matter of fact

    See also the Single UNIX Specification: it purports to be an open standard, and you sure can implement it in a completely naive way, from scratch, clean-room, etc… but if you follow the original UNIX design, it’s so much easier

    As for the "hacking" attitude, thinking "before", etc. IFS is probably the one and only I/O API that was designed since the beginning with native network filesystem capabilities in mind – "networked" attribute for files, "opened from network redirector" flag, "fast network query information" operation, etc. not to mention DMA support in the cache manager, whose entire motivation lies in efficiently transferring data from a filesystem to a NIC

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index