Date: | February 1, 2007 / year-entry #36 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20070201-03/?p=28193 |
Comments: | 60 |
Summary: | I post this entry with great reluctance, because I can feel the heat from the pilot lights of the flame throwers all the way from here. The struggle with the network interoperability problem continued for several months after I brought up the topic. In that time, a significant number of network attached storage devices were... |
I post this entry with great reluctance, because I can feel the heat from the pilot lights of the flame throwers all the way from here. The struggle with the network interoperability problem continued for several months after I brought up the topic. In that time, a significant number of network attached storage devices were found that did not implement "fast mode" queries correctly. (Buried in this query are some of them; there are others.) Some of them were Samba-based whose vendors did not have an upgrade available that fixed the bug. But many of them used custom implementations of CIFS; consequently, any Samba-specific solutions would not have helped those devices. (Most of the auto-detection suggestions people proposed addressed only the Samba scenario. Those non-Samba devices would still not have worked.) Even worse, most of the devices are low-cost solutions which aren't firmware-upgradable or have any vendor support. Some of the reports came from people running fully-patched well-known Linux distributions. So much for being in all the new commercially supported offerings over the next couple months. Furthermore, those buggy non-Samba implementations mishandled fast mode queries in different ways. For example, one of them I was asked to look at didn't return any error codes at all. It just returned garbage data (most noticeably, corrupting the file name by deleting the first five characters). How do you detect that this has happened? If the server reports "I have a file called e.txt", is Windows supposed to say, "Oh, I don't think so. I bet you're one of those buggy servers that chops off the first five letters of file names and that you really meant to say (scrunches forehead in concentration) readme.txt"? What if you really had a file called e.txt? What if the server said, "This directory has two files, 1.txt and 2.txt"? Is this a buggy server? Maybe the files are really abcde1.txt and defgh2.txt, or maybe the server wasn't lying and the files really are 1.txt and 2.txt. One device simply crashed if asked to perform a fast mode query. Another wedged up and had to be reset. "Oh, looks like somebody brought their Vista laptop from home and plugged it into the corporate network. Our document server crashed again." Given the much broader ways that servers mishandled fast queries, any attempt at auto-detecting them will necessarily be incomplete and fail to detect broken servers. This is fundamentally the case for servers which return perfectly formed, but incorrect, data. And even if the detection were perfect, if it left the server in a crashed or hung state, that wouldn't be much consolation. Given this new information, the solution that was settled on was simply to stop using "fast mode" queries for anything other than local devices. The most popular file system drivers for local devices (NTFS, FAT, CDFS, UDF) are all under Microsoft's control and they have already been tested with fast mode queries. Such is the sad but all-too-true cost of interoperability and compatibility. (To address other minor points: It's not the case that the Vista developers "knew the [fast mode query] would break Samba-based devices since late 2005". The fast mode query was added, and the incompatibility with Samba wasn't discovered until March 2006. "Why didn't you notify the Samba team?" Because by the time we found the problem, they had already fixed it.) |
Comments (60)
Comments are closed. |
I think the important question is what does it do for network devices? Does it revert to WinXP behavior or a slower version of Vista’s local device behavior?
Raymond, thanks for posting this message.
I stayed out of the original discussion but a lot of the comments struck me as being somewhat naive in terms of what is really out there and the procedures for getting such things fixed.
You have once more proved to me that despite many problems there are still many competent people at Microsoft (not that I really doubted that anyway), despite many reports to the contrary (of which undoubtedly a lot are also thruthful).
To end with a technical note, is it possible for an administrator to configure Vista in some way to use fast-mode queries with specific non-local devices (e.g. by IP address)?
Well, my question is, why did this problem occur in the first place and what could have prevented these buggy implementations from being buggy?
Rosyna,
There’s a link to the previous article. Did you read it?
A time machine. Raymond’s being a lazy SOB and still hasn’t coded that up.
> … what could have prevented these buggy implementations from being buggy?
The way to prevent buggy implementations is not to ship features that aren’t actually used. If a protocol contains functionality that isn’t widely exercised, people can get away with wrong implementations that may seem to work ‘just fine’ out in the wild. Until someone later tries to use that functionality, that is :-)
If the original protocol had been smaller and simpler, and as long as it included a version querying mechanism, Vista could have extended the protocol and been able to reliably query for devices that supported the new extension. But then you’re into a whole other world of pain, and people start to complain about shifting targets, constantly evolving ‘standards’, etc.
Perfect solutions tend not to exist in the real world.
I too am curious if there’s going to be any way to turn on fast-mode queries for (some/specific/all) network devices.
Shawn: And what if the protocol did include a version querying mechanism, and some subset of devices had buggy implementations of that mechanism?
I do feel sorry for the poor saps at Microsoft who have to try to cover up for hardware vendors that ship second-rate products with third-rate drivers. There’s no way to win, and I think the overall hardware/driver quality is actually dropping.
In the end you probably made the right decision here. No hacks, no unsolvable problems for users, and most users probably won’t care anyway since they’ve been using XP’s "slow" mode for years.
Rosyna,
First you must answer this question: What could have prevented stupid questions from being stupid?
> … what could have prevented these buggy implementations from being buggy?
A complete, easily accessible specification that contains an exhaustive list of test cases that any implementation must implement and that can be checked automatically?
You’ve just moved the bug to the version check & handler.
Have you seen how many installers (or other programs) can’t properly figure out which version of Windows they’re running on? Look at the remarks for Win32’s GetVersion() API, which explains why MS have since added the version checking APIs GetVersionEx() and VerifyVersionInfo():
======================
The GetVersionEx function was developed because many existing applications err when examining the packed DWORD value returned by GetVersion, transposing the major and minor version numbers. GetVersionEx forces applications to explicitly examine each element of version information. VerifyVersionInfo eliminates further potential for error by comparing the required system version with the current system version for you.
======================
Consequence for the future: Publish your internal standards in time, cooperate with other vendors (including Linux developers). This is the best interest for your customers, too.
My question is, why isn’t there a field like HTTP’s User-Agent so you can tell whose implementation you’re working with and, if it’s known to be buggy, adjust accordingly?
This story adds to my suspicion that "standard documents" don’t work very well for software. Having some committee approved document that specifies a "standard" which other companies try (and often fail) to implement is not very efficient. A document is always going to contain some vagueness or ambiguity. If it did not, then it would be code. And even if the document is perfect, someone somewhere is going to screw up an implementation. In this model, all mistakes made by everyone combine together to erode away the functionality specified by the standard.
A more efficient model is if everyone just runs the same code. Any mistake made in implementation automatically becomes “the standard". A single workaround will work for everyone.
This is just another reason why there is such a large Windows monoculture. The software business is a natural monopoly.
Because if there were a user-agent field, three out of four bad implementors would leave it set at the value they found comming out of some other product they poked at.
Sadly, there is very little MS can do to deal with situations like these – the folks who make these cheap boxes don’t have a lot of patience for testing, and they just ship it if it seems to work.
The only thing MS might physically be able to do is to require devices to go through something like WHQL before they’ll interoperate with them. That, however, would put them in the position of gatekeeper to the marketplace for darn near everything. Which would get them yelled at and probably sued (I imagine such a move would appear to be an anti-trust violation).
If they provided an exact spec along with test cases that would certainly help, (And I bet the Samba team would love such a thing), but I also bet that most vendors wouldn’t bother to run the tests. Or even read the spec thoroughly.
Sadly, when consumers demand cheap products, they get cheap products. No matter what the cost.
Somebody: If documented standards don’t work, then how are we posting to this blog? HTTP is a documented standard, as are both TCP and IP.
Now yes, for TCP and IP, most people run the original BSD stack (or something derived from it). But HTTP is not done that way; lots of people run Apache, and lots of people run IE, but those aren’t the only servers or clients out there by a long shot.
More on topic: Yeesh. ;-) Given this info, my opinion is changing on this issue. I figured it’d be easy enough to work around the issue before: send two half-size requests to the server in fast mode (inside FindFirstFile), and if the second fails, restart in slow mode. Unfortunately for MS, this makes buggy NASes even slower, but at least the (many?) non-buggy ones get rewarded. Oh, and do this checking inside the CIFS-specific filesystem driver, not inside any user-mode code at all, including FindNextFile / NtQueryDirectoryFile. The CIFS driver already batches up queries, so simply modifying its batching code has the potential to work.
But since other devices were failing in other ways, that screws up that entire solution… :-(
Why not introduce a "fast mode 2" query type that does the same thing as "fast mode", but is unrecognizable by these old devices and Samba versions?
Of course, this would lock out current versions of Samba and Windows as well, but it sounds like you’re already doing that anyway. Later, both could be patched to support "fast mode 2" queries.
"But HTTP is not done that way; lots of people run Apache, and lots of people run IE, but those aren’t the only servers or clients out there by a long shot."
Yes… and ironically this exact page doesn’t show correctly in Thunderbird’s RSS reader pane for me (the "menu" on the right overlaps the content).
Argh! And I just discovered that the Submit button isn’t working on my current (admittedly nightly) build of Firefox… resubmitting with IE.
What was your point again?
Samba "mostly" works…
"Hooray, there are three file enumerators in the shell. -Raymond"
TRUE, FALSE, FILE_NOT_FOUND?
It’s a real shame that so many problems are ascribed to Microsoft when, in reality, they stem from vendors that are on the platform.
I’m happy how, as Vista is being pushed into the marketplace, Microsoft seems to be getting tougher with respect to hardening drivers and giving people a good OBE.
Ray: You’re confusing levels of abstraction; neither RSS nor this page’s submit button are HTTP or TCP or IP. XML doctypes (or schemas, either way) are not the same thing as the protocols used to transmit them, and HTML/JS are not the same thing as the protocols used to transmit them either.
The protocols are working fine, it’s just that your user agent’s idea of what it should do when it gets a certain string of characters is different than the server’s idea of what the user agent should do. That’s not HTTP’s fault, it’s the fault of whoever decided to serve up that specific RSS file structure, or that specific JS code. Or your nightly browser build.
(I’m assuming your submit button issue is due to some kind of script that runs when you hit it. I don’t know, because there’s very little information in your post.)
BrianK:
HTTP, TCP, IP etc. in isolation are useless. There has to be a client to transfer something meaningful. So in this case, because the standard specifies such little functionality, there is not much room to screw up. So interoperability will mostly work. But there are still problems with implementations. Does anyone remember SYN floods? The ping of death?
Where is the SMB specification published on Microsoft’s site anyway? I’d love to see what goes on in the fast mode query that could cause so many people to be confused. I assume it’s laid out pretty clear in the specification.
I’m a little surprised that you (that’s the collective you) didn’t add a new query command to the protocol and deprecate the old one due to its being implemented wrongly. The commands that Windows actually uses are rather more likely to be implemented correctly first time.
Just remember that anyone other than Microsoft implementing SMB/CIFS in a MS compatible way is most likely doing it without any help or information from Microsoft (at least I think thats the case, its certainly the case for the Samba people)
Why not just release the same implementation under a totally new API? This would give you and them a fresh start…
I agree disabling FastMode for network devices is the right idea. Its been broken & made unusable.
Extending the protocol, and add a "NewFastMode" mechanism (idential functionality to FastMode, but different query).
Document it & share with Samba folks. Offer patches for Win2K/WinXP to support it (Microsoft is already doing this for other Vista features – why not this).
As people upgrade their servers (WinXP & Samba), they share in the bounty that is Vista.
And I too wonder what the thoughts were on adding a ‘new fast’ query protocol.
Legolas, "A", and anyone else who proposed a "Fast Mode 2" option clearly didn’t check Raymond’s linked articles, where he already answered your question…
http://blogs.msdn.com/oldnewthing/archive/2006/04/19/578991.aspx
"Legolas, "A", and anyone else who proposed a "Fast Mode 2" option clearly didn’t check Raymond’s linked articles, where he already answered your question…"
Evidently that article was written before the decision was made to disable fast mode completely on network drives. He claims that switching to "fast mode 2" would result in slower performance because older implementations only support the original fast mode. But obviously the situation now is different: using "fast mode 2" in Explorer as opposed to not using fast mode at all could only improve performance. (Assuming, of course, they didn’t already come up with some alternate solution that performs just as well.)
Brent Royal-Gordon asks:
"why isn’t there a field like HTTP’s User-Agent so you can tell whose implementation you’re working with and, if it’s known to be buggy, adjust accordingly?"
If it really *was* like HTTP’s user-agent, then one side would be unnecessarily cautious in adjusting accordingly (as with web sites that check for IE and reject everything else) and the other side would have an "impersonation" switch (as with browsers that can lie about their identity to make those sites work).
Sadly, compatibility is a war. It’s best to assume that the other side is actually "hostile" and really is trying to break you, albeit for (in their opinion) legitimate reasons.
> why isn’t there a field like HTTP’s User-Agent so you can tell whose implementation you’re working with and, if it’s known to be buggy, adjust accordingly?
Yeah, User-Agent sniffing on the interweb has so far worked out to be *just great*!
I continue to love this story for its intriguing consequences…
So: In an interoperability world, if any large enough code-base (one product, several products using one code base, even everyone copying an example code) has a bug, everyone else is pretty much required to work around it somehow, in practice.
This feels strange, when you have a, shall we say, ‘dirty mind’ ;-)
So if MS wants to ‘attack’ linux, it could just code up something that breaks when linux servers are present (like this example, and I’m sure there’s many more interactions between systems). Finding something like that and implementing it at the right time would mean that Linux would effectively get banned form whole networks, because it crashes the windows servers (but, danger is that it turns around and windows gets banned).
Also, I wonder (and Raymond, this is an invitation to write about this, if it’s interesting!) if it is at all possible to avoid this, even in theory.
My initial guess is not really: you can ever really ‘force’ the other side to write bug-free code. Come to think of it, an OS actually has pretty much the same thing with drivers, and website makers have the same thing with browsers that are buggy or only partially implement certain standards (*cough* IE6 *cough* ;-).
It seems that if we could ‘fix’ this we could get a lot closer to software behaving like an engineering discipline, or is that pushing it a bit?
Jonathan Wilson, that’s the point I was trying to get at. Many, if not most, of these companies had to reverse engineer the CIFS specification with little to no help from Microsoft. In other words, if fast mode wasn’t used, there was no way for them to implement it correctly as many of these reverse engineering efforts began by looking at the network packets.
Even now, this is likely to happen in future implementations. IIRC, part of the DOJ/EU settlements was that MS properly document protocols for interoperability reasons. Part of the CIFS documentation is available at http://www.microsoft.com/downloads/details.aspx?FamilyID=c4adb584-7ff0-4acf-bd91-5f7708adb23c&displaylang=en however, you need windows 2000 or windows xp (vista isn’t mentioned, oddly). It cannot be read on other platforms and therefore may not be accessible for embedded device makers. I’m not even sure if it includes the Vista additions to CIFS.
As for Samba, there’s a document on how it was made at http://samba.org/ftp/tridge/misc/french_cafe.txt note the date, which was after XP was released and while longhorn was being worked on.
I wouldn’t use HTTP as an example of a completely-specified protocol that has led to perfect implementations.
One example: The abundance of ways to specify when a request/response ends (Connection: Close, Content-Length, Transfer-Encoding, etc), and the missing details of what to do if more than one is present, has lead to a whole new class of security attacks – HTTP smuggling.
I think a better example would be RPC.
For one, that will force you to ship a list of "buggy" drivers along with the OS… Which will only be correct until someone else writes a new buggy driver that you don’t know about. Either that or people will complain that MS is unfairly downgrading perf against their devices.
I work in a corporate environment where I know what all file sharing is done by, or can make a case for not supporting skunkworks samba’s. Can I get a GPO switch to get super fast mode everywhere?
What about my idea of having the clients send a random challenge, and then the server software signing it with a private key and returning. The clients would know how to find tested servers, which would know the right key.
http://blogs.msdn.com/oldnewthing/archive/2006/03/31/565878.aspx#566266
The idea is that you have a whitelist of know good server implementations, on which it’s safe to use advanced CIFs features. Legacy servers would use the old mode.
So Vista SP1 could contain the key on the server versions, and the code to challenge and authenticate on the client versions. Older OSs and non server variants would run in legacy mode. Incidentally, if the Samba people shipped code which managed to pass the authentication but still had bugs, you could just revoke the key and hold a press conference.
Actually maybe literally having a key is a bad idea. The client could authenticate the server by sneakier means, like looking for quirks in the way it responds to requests. I bet you could figure out a way to tell a Vista server from Samba or XP if you knew enough about it’s internals for example. Once you know it’s Vista, you know it’s safe to use the new features, but it’s important that it can’t fool you.
First, I would like to thank Raymond for having the courage to post this. Now we finally can see the end of this saga.
To all clamoring for documentation: it wouldn’t have helped, since the Samba team already knew how that part of the protocol should work. It was an implementation bug on their side. More documentation would not have made any difference in this particular case.
And to all saying "look at IP/TCP/HTTP, it is fully documented and works fine all the time": I take it you never heard of the problems with ECN, or about PMTU blackholes, or about the problems with some routers when using window scaling. And this is just a small sample.
"Given this new information, the solution that was settled on was simply to stop using "fast mode" queries for anything other than local devices. The most popular file system drivers for local devices (NTFS, FAT, CDFS, UDF) are all under Microsoft’s control and they have already been tested with fast mode queries. "
That is reasonable as a default, but there should be a way to reenable fast mode queries for network devices.
"Given this new information, the solution that was settled on was simply to stop using "fast mode" queries for anything other than local devices. The most popular file system drivers for local devices (NTFS, FAT, CDFS, UDF) are all under Microsoft’s control and they have already been tested with fast mode queries. "
That is reasonable as a default, but there should be a way to reenable fast mode queries for network devices.
And work with the vendors to fix the bugs, if it hasn’t already. And even if fast mode access is reenabled in general, it should be able to be disabled for specific servers, and vise versa.
i take i there will be no registry entry or Group Policy setting to force enable the feature; and let me take my own chances?
"That doesn’t help the servers that crash when given an invalid information level"
… wow. I give up
Here goes nothing, let’s explain the technical issue *all over again*
“Fast mode” is internally called FileIdBothDirectoryInformation. I do not know this because I have access to supa-seekrit Microsoft information, I know this because it’s public and because it has been documented to be the issue at hand. In other words, I know this because I can read
FileIdBothDirectoryInformation is actually a manifest constant with value 37, the “File*Information” form is standard and it means it’s an information class for file object, “Directory” means it applies to directory enumerations, “Both” (as opposed to “Full”) means it also returns the short 8.3 name, and “Id” is the “fast” part
“Id” means the information class returns the file id (the inode number for NTFS, a random unspecified value for philistine filesystems). Basically, “fast” enumerations are functionally equivalent to UNIX readdir. Before “fast” enumerations, which were introduced in Windows XP, the way to “readdir” was to query for FileBothDirectoryInformation, and then *open* each and every enumerated file, query for its id (FileInternalInformation, if you were wondering) and close it. Very disruptive, even locally, but especially so over a network connection
Now, CIFS is pretty much just a serialization of NT I/O semantics over the wire (for some reason this surprises and confuses people from the UNIX/TCP camps. What do they expect?). This means the “enumerate directory” command has a field that mirrors the information class. Now, the polite thing to do is to check said field, ensure it’s within a range of supported values and fail otherwise, with a status code that tells the client “I’m afraid I can’t do that, Dave”. The client can then set a compatibility bit on the server for the rest of the session and go with the slower version of the enumeration.
The correct status code would be something like STATUS_INVALID_INFO_CLASS, or even STATUS_INVALID_PARAMETER. Returning success is catastrophically wrong, because the new FileId field hasn’t been added at the end of the structure – it couldn’t be! the structure has a variable lenght, and the last field must be the filename, so FileId has been added as the new *next-to-last* field. Success almost certainly means the structure has been filled incorrectly
Returning the dreaded catch-all STATUS_UNSUCCESSFUL (“A device connected to the system is not functioning”) is wrong, as well – it makes the server automatically useless, as no directory enumeration will work ever again
How would I have fixed it: I would have defined a “wrong” information class, to be used as a test for broken implementations that ignore that field. At the first enumeration request class newer than a certain revision, a broken request would be sent as a probe: if the server answers anything but STATUS_INVALID_INFO_CLASS or STATUS_INVALID_PARAMETER, it’s immediately flagged. All future requests for that class on a flagged server automatically fail, so that the application (remember that, until here, it’s the client driver we’re talking about) can retry with a downlevel request
Okay then, time for a new file sharing protocol. If Microsoft is prevented from improving the existing protocol in order to maintain compatability with buggy non-MS implementations; then get rid of it.
Samba has always been horribly slow, and without the possibility of it getting faster it’s time to transition away from it.
The Visual Studio guys already started something with an "extra fast file sharing driver" for SourceSafe, since VSS runs off of samba. Run with something like that.
Make it redistributable for 2000/XP/Vista. We’ll install this new network service driver, and uninstall file and printer sharing.
And what’s more, Microsoft will tag it with "we don’t care if future updates to our operating system crash your buggy file servers. Use this protocol at your peril." And there will be much rejoycing.
I don’t have any broken devices.
Why should I have to put up with the crappy slow behaviour?
Does SMB2 include a provision for exactly identifying the software version at the other end so that this kind of thing doesn’t happen in the future? Let me guess….
Some of these buggy implementations crash when asked to perform this operation. If I’m not mistaken, this means that there are a significant number of buggy 3rd-party SMB servers out there that’re vulnerable to a denial of service attack by anybody who can authenticate read-only in order to attempt to enumerate a directory’s contents.
And some of the manufacturers of these devices no longer support them or have provided no firmware update mechanism.
Ouch.
Personally I think Microsoft’s handling of this issue makes perfect sense, though it must be frustrating. It’s also clear that client-side server discrimination (versioning and user-agent info) is unlikely to be effective because servers will get these wrong too or actively lie about them ("Suuuuure, I’m Win2k3 Server… really"). They might do so for what seems like a good reason at the time, but that’ll still break the whole system.
I’ll admit being a little surprised that an equivalent operation that won’t be recognised as such by older servers isn’t being added, but I’m sure there’re good reasons for that.
I’m one of those "surprised and confused" UNIX guys when it comes to how CIFS is done (and its incredible complexity)… but I can see the attraction from the PoV of the Windows-based implementer. You see the same thing in other places – especially apps that save files that’re essentially serializations of their in-memory document structures. They tend to suffer in robustness and be an absolute nightmare to work with from other software, but be very, very fast and very easy for the first implementer.
The comment you linked to (http://blogs.msdn.com/oldnewthing/archive/2006/03/31/565878.aspx#566308 ) specifically mentioned “testing” versions. You’re surprised that people running fully patched stable versions reported the bug?
Bob and DrPizza: The answer is compatibility.
As Raymond has said many times, if you buy a new operating system and discover that some of your applications suddenly don’t work, who are you going to blame, the application (which worked fine before you upgraded the OS) or the new OS?
In a perfect world, or in an Apple world, where you control the OS and (almost) all of the hardware, you can break things that were implemented incorrectly.
See:
http://blogs.msdn.com/oldnewthing/archive/2006/11/06/999999.aspx
http://blogs.msdn.com/oldnewthing/archive/2005/01/18/355177.aspx
http://blogs.msdn.com/oldnewthing/archive/2003/12/24/45779.aspx
and any other blog entry that mentions "compatibility".
I’m with DrPizza on this one. Why should my computer performance be penalized because somebody wrote a buggy server that doesn’t validate the known parameters without crashing?
*They* are the ones that should be outed; not Microsoft and not me.
Rosyna:
> Well, my question is, why did this problem occur in the first place and what could have prevented these buggy implementations from being buggy?
Actually implementing a standard with an open review process.
[It’s easy to say “Just make an exhaustive list”, but it’s impossible to do. An “exhaustive” list would be infinitely long. -Raymond]
HTTP works pretty well, even with its warts.
> My question is, why isn’t there a field like HTTP’s User-Agent so you can tell whose implementation you’re working with
Because SMB wasn’t intended to interoperate with anything.
My take on this whole thing is that you have to allocate another network port and use it for a separate protocol. This protocol can behave similarly to SMB, with lessons learned applied; this means that the protocol should be clearly documented and include such things as command encapsulation and versioning.
[I had trouble parsing this. What should have an open review process, the standard or the implementation? -Raymond]
The standard should be developed in an open manner, with public review (basically, look at IETF and follow their lead). This ends up with a reference implentation which is both compliant and pedantic, the better to measure other implementations against.
DrPizza: the broken implementations of the future will just lie about it…
Craig: I don’t find CIFS all that hard. I can appreciate an utilitarian design, especially compared to the overdesigned NFS or the byzantine WebDAV, both of which gross tagsoups developed against checklists of buzzwords-of-the-day. Windows networking isn’t, of course, just CIFS, and I won’t go into that…
[And that would prevent other implementations from having bugs? Less likely, perhaps, but prevention? -Raymond]
prevention is not the goal. a process that minimizes errors and misunderstanding is.
[And that would prevent other implementations from having bugs? Less likely, perhaps, but prevention? -Raymond]
While it is practically impossible to create an exhaustive test suite for most “interesting” protocols or languages, even an incomplete test makes it easier for an implementor to check for specification compliance.
An example is the “acid2” browser test for HTML/CSS; one quick check enables me to see that my firefox is still broken. This is useful even if the test is not nearly complete.
Of course, testing a CSS rendering engine is much easier than testing a protocol. But the problems at issue here do not concern anything intractable like timing behaviour, but only “simple” behavioural bugs, like sending the wrong error codes or no error codes at all, or crashing on malformed or unrecognized messages.
> That probably would have passed the simple validation test.
Not if the test had required both an SMBfindfirst and an SMBfindnext request to pass.
The problem showed up after 128 files, yes — but *only* because the local machine’s network redirector asked for the first 128 files from the server when user-mode asked it for the first file. Once user-mode asked for the 129th file, the network redirector asked the server for the next batch of 128, and the server failed at that point.
If the test had existed, and had tested both find-first and find-next, then the handlers for both requests on the server would have been tried, and the find-next handler would have been exposed as broken. (The test has to be written with the low-level protocol in mind, not the particular implementation of the client side that was chosen for the redirector. If you were writing a test-suite for HTTP, for instance, you wouldn’t leave the POST verb out because no browser issues it as a "first" request to the server (because they all issue a GET for the form page first). You’d focus on testing the implementation of the protocol itself, and that means both POST and GET requests.)
When Microsoft wrote the File and Printer Sharing for NetWare Networks service, I am assuming they reverse-engineered NCP in much the same way that everyone else reverse-engineered SMB, and it apparently doesn’t actually report the version that it implements.
Here’s how to stop these sorts of bugs in the future:
Publish the full and complete SMB/CIFS specs under a non-discriminatory (i.e. GPL friendly, non-NDA) free-of-charge licence.
Also, an SMB/CIFS implementation verifier (a Windows program, plus a fileset for a test share) would go a long way to helping.
The problem is entirely Microsoft’s fault. Forcing people to either buy Windows or rely on reverse-engineering in order to use their network is silly.
Of course, this will never happen. It is in Microsoft’s best interests to keep the SMB/CIFS specs away from the Samba team, so they can come up with PR lines like "we had to slow down all SMB/CIFS networking because of a bug in Samba".
Maybe now that Microsoft has *finally* complied with the EU order to release protocol specs we might get somewhere. I won’t hold my breath.