Date: | April 19, 2006 / year-entry #139 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20060419-14/?p=31473 |
Comments: | 73 |
Summary: | One of the suggestions for addressing the network compatibility problem was to give up on fast mode and have a new "fast mode 2". (Equivalently, add a flag to the server capabilities that means "I support fast mode, and I'm not buggy.") This is another example of changing the rules after the game is over, by... |
One of the suggestions for addressing the network compatibility problem was to give up on fast mode and have a new "fast mode 2". (Equivalently, add a flag to the server capabilities that means "I support fast mode, and I'm not buggy.") This is another example of changing the rules after the game is over, by adding a flag to work around driver bugs. Consider a hypothetical program that uses fast mode on Windows XP. It runs against a Windows Server 2003 server and everybody is happy. Suppose you make a change to Windows Vista so that it requires that servers set a new "fast mode 2" flag in order to support fast mode. When the customer upgrades their client from Windows XP to Windows Vista, they would find that their hypothetical program ran much slower. Whose fault is it? Not the hypothetical program that was using fast mode on Windows XP; that program is using fast mode correctly. Not the Windows Server 2003 machine; that server supports fast mode correctly. Is it Windows Vista, then, that is at fault? "Hey, don't blame me," you answer. you answer. "It's that guy over there. That guy you've never heard of. He made me do it. Blame him!" To describe this sort of behavior I like to steal a phrase from Albert Einstein: "Spooky action at a distance". (Einstein used it to describe what in modern physics is known as quantum entanglement.) In this particular situation, we have a conversation between two participants (the client software and the server software) mediated by a third (Windows) which collapses due to the mere existence of a fourth party not involved in the conversation! It's as if your CD player suddenly lost the ability to play any of your music CDs because some company you've never heard of halfway around the world pressed a bunch of bad CDs for a few months earlier this year. Some people suggested, "Why not have a flag that says 'I support fast mode'?" Indeed that flag already exists; that's why Windows Vista was trying to use fast mode in the first place. The problem wasn't that the server didn't support fast mode. The problem was that the server had a bug in its fast mode implementation. "Okay, then add a new flag that says 'My fast mode isn't buggy.'" Consider also how this course of action would look after a few revisions of the specification:
If a new capabilities flag were created for every single server bug that was discovered, the capabilities mask would quickly fill up with all these random bits for bugs that were fixed ages ago. And each time a bug was found in any one server, all servers would have to be updated to add the new capabilities bit that says, "I'm not that buggy server you found on April 8th 2006," even the servers sitting in a locked closet whose operating systems are burned into EPROMs. And if you're the author of a new server, which capabilities bits do you set? Do you claim that you don't have the bug where
Or maybe you're convinced that you don't have any bugs in your "fast mode" implementation. Do you report Now, I'm not saying that this course of action is out of the question. Sometimes you have to do it, but you also have to realize that the cost for making this type of change is very high, and the benefit had better be worth it. |
Comments (73)
Comments are closed. |
"the capabilities mask would quickly fill up with all these random bits for bugs that were fixed ages ago"
You mean the bugs that you complain are still in place on so many servers and network attached devices all over the world? Do you think they will go away in six months? A year? What? If so then you shouldn’t worry about this problem at all, because by the time Vista ships the problem won’t exist any more.
Well the obvious solution is that Microsoft post the full CIFS and SMB specification, making sure to have Errata for all MS OS’s ever released.
Giving people incomplete info and then asking them to solve the problem is only going to make both sides look foolish.
I am glad you posted a followup…
I would be surprised if a solution was found that would satisfy the majority of the readers of this blog (or even the minority for that matter). A lot of people see this as something to rant about.
oh vince, you assume that if the specs are on the street (and they might be I don’t know) everybody is going to make a perfect implementation. yeah right…..
I am glad that these kind of problems get attention, it feels like the problem you once described where a video driver would report yeah I can do that on any directx query that was fired at it, unfortunately I don’t think this one is as ‘easy’ solved.
Well in keeping with the law of fixes "do the fix at the lowest possible abstraction layer", you have to fix this problem at the protocol level (and make the protocol just a bit uglier) or just do nothing. What wasn’t clear from the original problem description was that XP clients also use fast enumeration if available. In that case XP clients interacting with old Samba servers already have been suffering from this for years. That more or less tips the balance towards ‘do nothing’.
Joe, In Win XP, it uses the slow mode by default, ( so basically every one is using it ), that’s why the problem is not really common. You can turn the fast mode on ( I don’t really know how to do it, but you can ). With WinVista, they are trying to make the fast mode the default ( so it can benefit from it ) and thats where the problem will show to every one
I know zip about networks so I ask, how often do servers get swapped out invisibly (does windows monitor their presence or only see them when you attempt to access, is there some unique identifier that would change with any change)?
Why not add a service that is run whenever this network/protocol/whatever is available. Add a hook to the appropriate network drivers so this service can be notified to do its business on connect or first access.
The first time a connection is made it makes any necessary tests/queries to determine if a problem exists and sets mode accordingly. Any time the connection is dropped the service requeries and resets the mode before next access.
Benefits – Aside from the hook, the network drivers don’t change. New/multiple/selectable checks can be added by changing the service and not the network drivers (although the service could only work around bugs that could be fixed by mode or setup changes). Don’t have to even run the service or checks if you’re certain of the hardware.
Probs – several thousand windows machines doing test queries at once when they detect a server or disconnect/reconnect… lessened if checked only when an attempt to access is made? Wouldn’t work if servers can be swapped invisibly. Listening to a moron pulling guesses outta his behind?
Why i suggested this is that it will keep functioning, altough in a degraded manner.
One alternative way of handling this is for both client and server exchanging capabilities and version/implimentation information, and then handshaking until a set can be found that both agree to support.
Atleast in this way the ‘newest’ version can decide the right compatibility mode that both can support correctly.
But you can’t add signatures and workarounds for yet undiscoverd problems.
Bart: Whether you do it via a capabilities mask or by negotiation you have the same problem. With negotiation, you’ll have the client asking the server, "So, do you have that 129-file bug?" and the server will say "Huh? What 129-file bug?"
You dont ask it if it has a bug, you ask it what it thinks it can do.
Then you ask it what implementation it is.
If you know about bugs in its implementation you remove those from the set of things it can do.
Things it reports it can do but you don’t know about, you don’t support and you can safely drop those.
The server does the reverse with you.
This results in two sets of possible things that both think the other can do correctly, take the intersection of these sets and you have what is usable according to the information available to you.
To Vince:
No one is trying to make anyone look foolish. The point of Raymond’s post, if I’ve understood it correctly, was not to obtain a free solution, but to highlight the issues involved in implementing certain fixes.
This obviously generated a lot of interest – because we like to solve problems.
Some of the suggestions were impractical (due to the negative impact they would have on end users), so this post was presumably made to focus on a smaller point and highlight some of the reasons why.
Believing that everything would be fine if the specs were available is being unrealistic. If I remember the orginal networking thread correctly, the developer of the samba issue admitted that it was an oversight on his part (it was not due to having limited knowledge of the SMB protocol). It was also pointed out by someone else, that the 3 minutes or so end-to-end fix time seemed suspicously fast, if it included testing, etc. Read between the lines of that comment and it might show the real reason such errors creep in.
Raymond,
Why are you reluctant to put the user in charge of this issue? All that needs to be done to solve it is this:
Vista uses fast mode by default. Everything works great and the user is happy with the faster response in Vista until Vista encounters a samba server with the fast mode bug. Vista pops a dialog saying, "An error occurred while attempting to communicate with \xyz in enhanced mode. Using legacy mode might alleviaate the problem. Would you like to reattempt connecting to \xyz using legacy mode?" Include a check box "Always use legacy mode for \xyz?".
That seems to solve the problem. You have given a brief and non-scary explanation of the problem and placed the blame where it belongs, on server \xyz. Then you have given a solution to the problem that the user can implement if they choose and permanantly if they prefer to not receive this message in future sessions. Now the list of "bad" servers is controlled by the user, the user is aware of the issue and can possibly report it to the proper people, etc.
To kbiel,
Try this thought experiment:
Remove all compatibility fixes in your OS (the ones that insulate you from bugs in other software/hardware). Replace them all with ‘non-scary’ dialogs. Make sure you use words that everyone will understand. E.g. "\xyz", "enhanced mode", "legacy mode", "alleviaate", and throw in some big bang words like, "problem" just to make it less scary for the user. Make that word "locale RED" so that it stands out even more. Now pop that dialog up everytime an app calls an offending function (let’s assume it hasn’t crashed the PC, and that an unatended PC isn’t brought to its knees while attempting to display the millionth ‘non-scary’ dialog.
Now, go about your business and see how long it takes before you scream, "stupid $4it4ead fuc£ing computer!".
I’m a little tired of this thread about this problem supposedly caused by the current Samba implementation.
I am pretty sure that, before the moment Vista will be released to the general public, Samba will be already corrected and support FastMode.
Andy Blues: The bug is already fixed in Samba (e.g. read KJK:Hyperion’s and Joe Butler’s comments just above).
But the big problem is that a version of Samba with this bug is shipped together with lot’s of NAS, making it hard to fix for home users.
In addition it is really interesting to read, for some people like me, about pros/cons of fixes for bugs. I especially liked the link to the "Spooky action at distance" making me read a little bit of quantum mechanics in the end :)
"oh vince, you assume that if the specs are on the street (and they might be I don’t know) everybody is going to make a perfect implementation. yeah right….."
Oh but how much harder is it to create an implementation when you don’t have specs at all!! That’s the point.
"What wasn’t clear from the original problem description was that XP clients also use fast enumeration if available."
Yeah, I thought from the original description of the problem that is something newly implemented (or activated) in Vista.
Hyperion,
Thank you very much for your comment. You said all the things I was afraid to say.
I wish that n00b Jeremy Allison who commented on the original post about this could see what you just said. His response of why this happened in Samba was that it was a failure to communicate by Microsoft.
James
Hyperion: I suggest you check out devel.samba.org. I quote:
> Important: In order to avoid any
> potential licensing issues we
> require that anyone who has signed
> the Microsoft CIFS Royalty Free
> Agreement not submit patches to
> Samba, nor base patches on the
> referenced specification.
Does Microsoft have a validation suite that open source implementations can use?
The Samba guys had to reverse engineer the protocol, so it’s not surprising they got some details wrong. Hiding the details from "competitors" like Samba harms Microsoft, too.
Sean,
Despite how much of a Microsoft fan I am, even I must admit that MS has upset me a time or two. However, I content that MS is still far better than the competition. Would you rather deal with situations like this?
http://news.zdnet.com/2100-3513_22-6061491.html
James
Did I misunderstand the problem?
As I read the original post, the SERVER is handling the request wrong, and the XP client runs happily in SLOW mode. Or am I mistaken?
In this case, nothing would change for a Vista client, it would use slow mode until the server is updated and indicates "fast is ok (really now)".
(and a version number could accumulate most of the bits)
peterchen: But older servers (that don’t have the bug) won’t set the "really now" bit. Result: You upgrade to Vista and your performance drops.
While it is way too late in the game for this comment to have any effect on this particular bug:
This would likely never have happened had Microsoft actually documented SMB/CIFS, and followed up on the "CIFS" thing (namechange, and provably non-honest "attempt" to define and open up the protocol).
I also think this whole spectacle actually serves as fuel for the EU problem MS now faces. Microsoft has proven to be almost completely incompetent in the area of documenting this disaster.
I hope Microsoft (finally) learns from this mistake, that communication protocols (be them over-the-wire, wrapped in RPC, syscalls or just plain function calls within a process) must be documented. Not just that the specific "call" exists and what arguments/parameters it expects, but also preconditions, invariants and postconditions.
kbiel: Putting the user in charge only works if there is a user! Displaying a confirmation dialog assumes that if a user existed they would know what decision to take…
This doesn’t take into account services without a visible desktop, software running on embedded systems or even ‘normal’ software on a pc stuck in a server room and never actually viewed directly – how would a confirmation dialog help in these scenarios.
peterchen: But older servers (that don’t have the bug) won’t set the "really now" bit. Result: You upgrade to Vista and your performance drops.
Just to make sure that I understand, it’s really "and your performance doesn’t get any better" — because current clients use the one-at-a-time mode?
sayler: No, your performance actually drops. Please go back and re-read the second paragraph.
> This server supports fast mode and doesn’t have the bug where enumerating a directory with more than 128 files fails on the 129th query
I hope people realize by now that this isn’t a good description of this bug. Yes, it does describe the behavior that a user program would see when running against a server with the bug. But the bug was that the SMBfindfirst packet succeeded, and then the SMBfindnext packet (the second one in the enumeration) returned an error. The 128/129 number is merely an artifact of the redirector’s batching process — if the redirector batched 64 files into one SMB packet, then you’d see the problem on the 65th file coming back from NtQueryDirectoryFile.
(Unless the protocol forces a "count" of 128 in all SMBfindfirst and SMBfindnext packets? That would seem extremely braindead to me.)
That’s not to say that the right fix is to add a flag that means "I don’t have this bug", though. (This is partly because of the reasons you cite, but also partly because requiring a server to turn a flag *on* to get behavior that it *already* does correctly seems completely backwards to me.)
vince: I already gave this information in earlier entries in this series. Perhaps I should give up on this series since people just keep asking questions I’ve already answered.
oldenewthing:
> I already gave this information in
> earlier entries in this series.
> Perhaps I should give up on this
> series since people just keep asking
> questions I’ve already answered.
I meant real numbers. As in, ones measured on an actual system under normal behavior.
Your previous asnwer started with "assume a latency of 500ms"… do you expect most people to have their NAS servers available over a satelite link or something?
Many companies have very high latency links; a half second is not unreasonable. The latency doesn’t depend on whether you are talking to a Windows XP workstation, a NAS box, a Windows 2000 server, or a Novell server. You don’t want to lose the benefits of fast mode when talking to a server that supports it.
Have a field that allows the server to identify itself, by vendor name and version. Compare HTTP’s Server: header.
Build a database of workarounds like:
* Vendor Acme versions 3.4 through 3.9 have the bug where enumerating a directory with more than 128 files fails on the 129th query
* Vendor Bilix versions 12.7 through 12.11 have the bug where the long file name is reported incorrectly in the response packet
* Vendor Colefa versions …
Apply workarounds for the vendor/version/bugs in the database
Make sure to give the vendor an opportunity to put a cap on the version number with the bug so that future versions of their software are not fighting a permanent workaround
vince, you have absolutely no bloody clue about the issue. The Samba people are kept updated about all such changes to the protocol by none other than Microsoft. They as a project own several copies of the Windows filesystem SDK
=====
"Fast mode" isn’t even an extension of the protocol, it’s just a new function code for the "find first" and "find next" operations, with its associated new data structure
Some specifics: the new function code is FileIdBothDirectoryInformation, and the associated structure is FILE_ID_BOTH_DIR_INFORMATION. The former is just a number – if they knew how to encode FileBothDirectoryInformation, they’d have no problem with the new one. The latter is FILE_BOTH_DIR_INFORMATION with an extra field, of fixed size. Again, if they supported FILE_BOTH_DIR_INFORMATION, they should have no problem supporting FILE_ID_BOTH_DIR_INFORMATION
(actually the new function codes are two, and the second is FileIdFullDirectoryInformation, but it doesn’t really matter, because the same considerations apply)
Some background: the new modes return the file inode alongside attributes, name, short name, etc. They were introduced in Windows XP (like anyone who doesn’t live in the fairytown of "open protocols" knew), I presume to support natively the UNIX readdir() function, following the commitment of Microsoft into Services For UNIX. The alternative to enumerate the files and their inodes had historically been to open each file, query for its inode, and close the file, which is much much much slower and cannot even be batched – hence "slow mode" and "fast mode". I also have to assume that the inode is now used in Win32 as well, other than for the purpose of backups (BackupRead allowed you to retrieve the inode. In fact, BackupRead + BackupWrite used to be the only legal – if a bit roundabout – way to hard-link files before the official API)
=====
Information about *these* structures and constants is public and has been public for years now. Before it was, you could rely on the work of the likes of Gary Nebbet and Bo Branten, as all non-bigots like me did, proficiently. If you choose to be a "purist", insisting that those structures are part of the CIFS "protocol" (rather than the mere serialization of Windows I/O constructs they actually are) and should be documented as such, that’s your own damn choice to be a stubborn idiot
=====
So, what was the bug about? a stupid, stupid oversight. A goddamn typo. The "find next" operation returned an error code (STATUS_INVALID_INFO_CLASS, meaning FileIdBothDirectoryInformation isn’t supported) that was only legal for "find first" (the "checking the flag" Raymond mentioned)
Whoever implemented FileIdBothDirectoryInformation (which was – ahem, "open protocols", ahem – *inexplicably* implemented to begin with) forgot to add a case: to a switch(). A stupid copy-paste error. A honest mistake
Please stop blowing this out of proportion
And that quote, vince, would be relevant if the original bug had really been caused by lack of knowledge of CIFS specs, instead of by a Samba developer forgetting to add a case to a "switch", *as Jeremy Allison himself mentioned*.
Yes, I know, I know, an OSS developer making a mistake on his own. Gosh, is that even possible?
vince: There is no real-world data for this specific issue because no beta version of Vista ever had "slow mode inode queries". When inode queries were added, they were added the fast way.
I spent a half hour looking for something similar (you owe me a half hour of my life). One company computed that the extra Explorer network accesses on their network with 43ms latency was costing them US$35 million/year.
You don’t need additional bits in the protocol, or version number, or anything like that.
The registry has enough space for users to use a Control Panel applet to store the usernames and passwords to be used in accessing servers. There is room in the registry to add 2 bits for each of these servers to remember if the user said to enable fast mode or to disable fast mode or didn’t say, and another 2 bits to remember if you’ve detected success or detected failure or not detected yet. As for what the default should be when the user didn’t say and you didn’t detect yet whether the server has this bug. I just know that there’s room to store what you need to store.
As for the badness of displaying error message boxes on unattended systems, well, don’t display an error message box when Windows Explorer isn’t browsing the affected server. If Windows Explorer is open and browsing then it’s slightly safer to assume that someone intends to attend.
Now, is it really possible to detect bugs, automatically store information in the registry, and publish articles telling users how to make adjustments? In your favourite version of Windows, how many times did I open regedit and delete a NOIDE flag? How did I know that this was the way to undo a decision that Windows had made when Windows detected something that it didn’t tell me about, but when I felt I had fixed the problem? The answer is yes it’s possible, and SOME Knowledge Base articles are properly written and properly findable.
"Would you rather deal with situations like this?
http://news.zdnet.com/2100-3513_22-6061491.html
James"
Although not directed towards me, let me put in my own two cents on the subject of proprietary vs open drivers in an open OS. One can actually draw something of an analogy with the current situation here. The main problem, of course, is that samba had a bug in it. However, to correct this problem is difficult because hardware wasn’t designed to be upgraded. By analogy, many drivers have problems in them (bugs) and it’s very difficult to correct them because they’re proprietary. The end result is that there tends to be no clean solution.
At some point it becomes better to simply ignore the original hardware/drivers and start over with new hardware/drivers. Now, clearly not all hardware/drivers are buggy, but I think it’d be silly to claim most aren’t the case. But so long as there’s a simple means to update and fix those problems, it’s no longer a huge hurdle to overcome nor much reason to dump the hardware/drivers, that surely you’ve already invested a good bit of effort into, for something new that you will also have to invest greatly into.
It’s for this reason that I’m interested primarily into buying things that are open. While the open version might be buggier now, I won’t have to give up my investment at any point unless *I* choose to. In the long term, it will probably cost me less to go with more open systems. This is why things like DRMed hardware to only run signed binaries or ROM images that can’t be upgraded are bad.
Of course, if making drivers and ROMs work well was actually not actually an area that was most prone to cost cutting (as they’re a necessity, not money drivers), perhaps they’d be less prone to be buggy and I wouldn’t have to invest much other than money into them (my time is probably more valuable, considering how much time is truly worth to anyone). Until then, it doesn’t make sense to me to accept non-open software into things that they’ve such little vested interest in retaining proprietariness over. After all, I’m buying the physical 3D video card or the physical NAS; being obsessed enough about the software end to keep it closed ignores that the software is useless without the hardware and people are buying specific products chiefly because of said good hardware.
In that Newsgroup post mentioned by Sean, Raymond said: "You can say ‘It’s a bug in the application’ as many times as you like to the customer; the customer won’t believe you."
Well, he might believe you, but deep inside he will really be thinking ‘So what? It’s Windows that crashes, not the other app, so _you_ fix it!’.
I have a similar problem: my applications need to browse file system and virtual folders. Suddenly some customers started complaining that the software was crashing when trying to browse the Windows Control Panel.
This usually indicates a corrupt control panel applet, but, in this particular case, Explorer itself seemed perfectly happy with it, instead of also crashing like it does (used to do?) with corrupt cpl files.
After many hours trying to track down the problem, I finaly found the pattern: every system with the problem was running a Control Panel applet from a very popular vendor of web cams, keyboards, mice, etc…
From there to figure out what was happening was relatively quick: when enumerating items in the Control Panel using the IShellFolder interface (isfFolder.EnumObjects, isEnumIds.Next, isfFolder.GetDisplayNameOf, etc…) at *some point* the vendor’s Control Panel applet corrupts my stack.
Everything runs fine UNTIL my enumeration routine tries to return to it’s caller – at that point the application just crashes (it doesn’t even report an error, just disappears) because the return address in the stack has been blasted to oblivion.
I’ve tried a couple of other applications (not mine) and they also crash when trying to display the contents of the Control Panel, so I’m sure it’s not a problem with my code.
Now the big problem: since Explorer is able to defend itself from this kind of stack corruption and doesn’t crash, who do you think the user is going to blame? Worse, since Windows is apparently immune to this problem, the vendor has absolutely no incentive to fix this bug in their control panel applet.
Because this is a very popular hardware vendor (which means a lot of my existing and potential customers will be running their buggy Control Panel applet), I tried contacting them through normal support channels, and, needless to say, I got nowhere – but, even if I had and the vendor fixed the problem, there would still be a lot of buggy old versions of that control panel applet floating around.
With my software being written in Visual Basic, I currently haven’t got a clue on how to protect my stack from rogue Control Panel DLLs. So at least for the moment, I’m stuck.
If only Windows, instead of ‘pretending’ the problem isn’t there with some clever hack, crashed as my applications and the others do, then the vendor would have found and fixed the problem long before it started shipping its buggy Control Panel applet!
Here is how I would deal with the origonal bug that started this series.
From what I can tell, what happens is this:
1.Some program asks windows to go get the directory listing for a network drive
2.Windows asks the (buggy) server if it supports fast mode and the server says yes.
3.Windows sends a fast mode query to the server.
4.The server returns some records
5.Windows returns something to the application
6.The application iterates through and asks windows for the next record then processes that then asks for the next record then processes it and so on
7.This continues on untill a request is made and windows suddenly says "I have run out of records". Presumably at this point (before it returns to the caller) windows will be able to detect that it is talking to the buggy server.
Also, I am assuming windows knows how many records it has returned and where in the list it is.
The fix is, once windows detects the buggy server, to request the records again in slow mode. Once it has the records, it can skip past the records it has already returned to the application. Then, it can return the item the application would have been given had the server not had the bug. And then things can continue.
Here we go again….
KJK::Hyperion wrote :
"The Samba people are kept updated about all such changes to the protocol by none other than Microsoft."
Not true anymore. We used to be quite close with SMB development (I still happily have my "Microsoft Confidential" copy of the NT SMB specs here in front of me) but that isn’t true anymore. Others can speculate as to why this is.
Joe Butler wrote :
"It was also pointed out by someone else, that the 3 minutes or so end-to-end fix time seemed suspicously fast, if it included testing, etc. Read between the lines of that comment and it might show the real reason such errors creep in. "
Oh for heavens sake. Have you never looked at a bug report and realized *exactly* what your cut-and-paste error was ? That was the 3 minutes to closure. It was such an obvious bug reported by a long-time collaborator that I trusted him to re-open the bug if it didn’t fix his problem. Later an extension to smbtorture was added to prevent regressions.
Here’s a link to the description of the problem :
https://bugzilla.samba.org/show_bug.cgi?id=3526
and here (again) is the fix tridge proposed for Vista. Works fine, just needs someone to implement it.
From tridge:
"If we had run across the error you
described (INVALID_LEVEL from a continue)
then we would have added a bit flag on the
current connection structure to mark this
connection so it won’t use that level in
future, then repeat the search using a
different level. That means you would get
one useless search on the network with each
connection to a buggy server, but no impact
against non-buggy servers and no user observable
affects. The denial of service attack you
mention with this type of fix doesn’t happen
as the extra bit is per-connection, not long
lived (trying to remember long lived info
about specific servers is a losing game)."
Jeremy.
Raymond. . .
you are one smart dude!
I think the solution is clear. The automotive industry invented vehicle identification numbers. We just need Software Identification Numbers. Every api from hereon out will take an additional SIN parameter, which can be defined as follows:
struct SoftwareIdentificationNumber
{
GUID guid
int version_major
int version_minor
int version_build
int version_revision
SYSTEMTIME releasedate
SYSTEMTIME builddate
DWORD vendorID
DWORD productID
DWORD revisionID
DWORD reserved1
DWORD reserved2
DWORD reserved3
};
This way, instead of having this discussion we could’ve simply checked if the vendor/product/revision indicated that it was a particular type of buggy server, verified that the build date was before the bug was fixed, and decide to use the slow mode instead. Or we could decide that buggy implementations of undocumented protocols are not to be supported (doubly so for documented protocols), put out a KB article, and be done with it…
Jorge: It’s possible that Windows actually isn’t doing anything specific to work around the problem, but just that their stack *happens* to be set up in such a way that the problem doesn’t appear. It’s quite possible, since I presume Logit- uh, the "vendor" only tested with Explorer.
I think SIN’s are part of the solution, but won’t SAMBA just claim to be Windows XP or whatever the ‘right’ answer?
I think you need to have some cryptographically secure way to see if it’s a trusted implementation – i.e. one from inside Microsoft or a licensed implemenation based on MS code and tested in the HQL labs, or an untrusted one. Put the untrusted ones in legacy mode, i.e. slow (XP speed) but reliable. The trusted ones will get the best possible performance based on the version level they report.
It’s just like drivers need to be signed before they can be run, and the signing is tied to automatic testing. Hell if Microsoft are feeling generous (or scared of lawyers) they could offer to sign third party SMB implementations for a nominal fee after they’ve been tested.
Jonathan Wilson: The problem Raymond is trying to address is that currently there is no way for a server to identify itself as supporting fast mode, therefore no point in asking. If there was a mechanism introduced then none of the existing software, even the ones that did work, would report false until they have been updated. If a bug is present that prevents them working they may still report true anyway…
Your point 7 is the original problem (as far as I can see) – the server says it has no more records to return despite the existence of further directory entries. This doesn’t seem to offer a way for windows to magically detect the existence of a buggy server.
Just some thought: While we do not have any indicator for systems that DON’T work, we have pretty reliable indicator(The "operating system" information for domain machine account, which Samba server don’t fill it) for machines that DO work.
Perheps just popup warnings when running fastmode on systems that do not return "operating system" information will do.
And then just tell the Samba developer to add that information (Perheps "samba X.X.XX") when registering machine account will be sufficient for advertising the "new version which don’t have this bug", and it helps to identify the version if other bugs presents too.
From what I understand, while you cannot distinct various versions of SAMBA based on their handshake, you can distinct SAMBA and Windows. If so, why not assume all Windows servers are Fast-Mode compatible and all SAMBA servers are not, until they provide an extended handshake (e.g. identify by the name of "SAMBA-Foobar" or append a version number).
As to the issue of an arrogant programmer passing 0xFFFFFFFF in the compatibility flags field, that’s the reason why I’d make it represent versions instead and make the Vista client aware of quirks in certain versions of certain products. It bloats the client a bit, but at least you remain in control.
Paul: No, it doesn’t say "it has no more records to return despite the existence of further directory entries". Instead, it returns an error to the second packet that it sees.
From what I can tell, this is how it works (if this isn’t how it works, it should be how it works; clean separation of code and all that):
User code calls NtQueryDirectoryFile with some set of parameters to indiciate "I just want the first one". The OS asks the redirector for one file. The redirector asks the server for 128 files in one batch, and sticks them all into a buffer. It then returns the first result out of the buffer. Then the program asks for the next result, and the next, and so on, with the first 128 results in total coming out of the redirector’s buffer.
When the program asks for the 129th result, the OS asks the redirector, just like always. But instead of returning stuff from the buffer, the redirector goes back to the server with an SMBfindnext packet — and the response from the server to this second packet is an error.
(The handler in Samba for the SMBfindfirst packet type had support for the information level that Vista is requesting, but the handler for the SMBfindnext packet type did not.)
Ilya, AC: Making a database of "broken versions" has problems other than bloat. What is Windows supposed to do if no fixed version exists when it’s released? Windows can’t see into the future; it doesn’t know that version Y will work when version Y doesn’t exist yet. Sure, patches may work for that, but do you really want to be patching app-compat databases for the rest of your OS’s life? And who’s going to force people to provide a correct SIN structure when calling APIs, versus just copying the SIN structure that worked in the past for some other unrelated program? Or some program made by some other company even?
Rule Number One of JavaScript / DOM programming in the browser: NEVER base your behavior on user-agent strings or other methods of browser detection. JS supports detection of whether certain methods exist; use that type of feature detection instead of browser detection. (Well, whenever possible at least. In some cases, you can’t tell by the existence of a property or method; the bug you’re seeing has more to do with the function you’re calling doesn’t do what the DOM standard says it should do. But that’s rare.)
In this case, the redirector can detect the error by sending an SMBfindfirst / SMBfindnext pair on the first request from a program, instead of only sending an SMBfindfirst. (If it can’t because of the architecture of the OS / redirector interface, then that architecture is bad; there isn’t a clean-enough separation between components.) Then the redirector would not return anything until the SMBfindnext succeeds. Yes, this is one more round trip (and even one more for broken servers, but they should be rare), but it’s the only good way to detect this particular bug.
vince: Was it me personally who cost you those hours? If not, then don’t take it out on me.
raymond: re "You upgrade to Vista and your performance drops." –
But the applicaitons that use a fast protocol run "happily" only until it hits a location with >128 files (?!). So these old apps don’t really work very well – maybe it’s this tradeoff (working fast for most users broken for few vs. working slow for all until everything is upgraded) you shun.
(sorry for being so insisting.)
BTW.
I’ve bookmarked the original post under "Why Windows is so complicated" – and I think Microsoft desparately needs a few more guys like Raymond. Software problems don’t end at checkin.
peterchen: If you change the protocol to require all servers to set the "I am not buggy" bit as proposed here, then when you connect to an old non-buggy server (i.e., one that doesn’t set the "I am not buggy" bit because it didn’t need to before), Vista will run in slow mode, whereas XP will run in fast mode.
peterchen: Most servers support fast mode fine. That’s why the problem was found only recently. If no servers supported fast mode, then the bug would’ve been found as soon as Explorer switched to using fast mode!
theres not really a chance to make it best afterwards…. the only thing you can do at beginning when making an api to externals is passing in an sort of an User agent string at the handshake. and the newer client can, if he want, take care of an "buggy" version or program.
Vince: I’m sorry, I didn’t realize this was a peer-reviewed journal. I thought it was a blog.
James Summerlin: Just running an proprietary operating system is NOT a solution for proprietary drivers. And I do not worry about having a proprietary nVidia driver more then a weekly/monthly couple of minutes or so. There is nothing in the GNU GPL preventing me from using an nVidia or ATi driver also. It also does not prevent mere aggegration of those drivers on a given storage media. It *does* prevent GNU/Linux distributors to ship a kernel and Xorg with those drivers if they wanted to, provided nVidia and ATi agree as well. ZDnet covered an issue in great detail, perhaps that made the problem appear like big thing for all GNU/Linux users and developers to you, but in reality it isn’t. Eggagerated articles draw more attention, that’s a simple journalistic fact. The problem exists, sure, but you don’t have to worry about it. From personal experience, I find Linux better because I don’t have to install a driver for every PCI card I throw at it, only video acceleration.
So there *are* servers where fast mode is working correctly? *Now* I understand your problem :)
I simply assumed the broken server was the first time fast mode ever worked (more or less, in which case "now really works" is protocol change enough to warrant a flag/version check). – Sorry I missed this.
> I should give up on this series
please don’t. My vote now goes for "Requery in slow mode if the error occurs", and I’m looking forward to you picking it apart.
"You mean Vista will get the correct results and XP quite possibly won’t?"
That’s not what the customer sees. The customer sees their program that worked just fine on XP (because they used a non-buggy server) slow to a crawl on Vista.
Joe Butler>> Remove all compatibility fixes in your OS (the ones that insulate you from bugs in other software/hardware). Replace them all with ‘non-scary’ dialogs.
Thanks for the strawman, but I’ll pass. There are obviously bugs that the user never needs to know existed and can silently be corrected. On the other hand, Raymond has set this up as a bug that will effect a user and can not be easily corrected silently.
paul>> Putting the user in charge only works if there is a user! Displaying a confirmation dialog assumes that if a user existed they would know what decision to take…
You bring up a valid point I hadn’t considered. Then again MS is not adverse to taking the dialog route and just pumping the dialog message into the event log. Check your application event log and you’re likely to find something similar:
Event Type: Error
Event Source: Userenv
Event Category: None
Event ID: 1058
Date: 4/21/2006
Time: 10:42:41 AM
User: XXXXXXkbiel
Computer: XXXXXXXXXXXXXX
Description:
Windows cannot access the file gpt.ini for GPO CN={XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX},CN=Policies,CN=System,DC=corp,DC=XXXXXXXXXXXX,DC=XXX. The file must be present at the location <\XXXXXXXXXXXSysVolXXXXXXXXXXXXXPolicies{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}gpt.ini>. (The network path was not found. ). Group Policy processing aborted.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
KJK::Hyperion:
> "Fast mode" isn’t even an extension of the
> protocol, it’s just a new function code for
> the "find first" and "find next" operations,
> with its associated new data structure
If new features aren’t an extension of the protocol, what would constitute an extension of the protocol, in your view?
Vince: Raymond did look for some numbers for you. On the other hand, he didn’t find any, so that’s not really worth much. I propose the following solution: in payment, you Fedex him a shiny copper penny.
But please don’t Fedex it in "fast mode"; fast mode is buggy in the current version of Fedex.
Now can we ring down the curtain on this reenactment of The Rape of the Lock?
Sir, I am confused. Your original article said "XP always used the slow query". Yet your comments here say "upgrade to Vista and your performance drops". I don’t get it.
Anyway, seems like a swamp. If we don’t want Vista to get blamed for that buggy-fast-mode server, then we do nothing and continue to use slow mode. Maybe a KB article for sysadmins to do something different if they know their environment is ok.
James Risto: This article was a hypothetical scenario. "Consider a hypothetical program that uses fast mode on Windows XP." That program isn’t Explorer; it’s some other program that uses fast mode on XP.
How does one enable the fast mode on Windows XP/2003? I’m regularly accessing a fileserver which takes almost half a minute to load certain directories, and wouldn’t mind trying a speedup :)
"Fast mode" is already supported in Windows XP. You can read the Samba bug report for details on how to try it out. But I doubt it’ll help your scenario since you probably aren’t asking for the inodes.
I am sorry but isn’t the standard solution to such problem to fall
backward to previous version when error is detected (In this case for
the time of the session or hopefully just for that unlucky
connection)?
Faulty drivers, hardware and software will be always around there and
idea to deal with them specially hacking original code and putting
such hacks into dis is no smaller mistake than original bugs. For example because it can give not expected results with other software…
Another question specifically to Hyperion:
Why do you think programers in Samba team should know IFS? Better said
why actually they should read windows documentation?
As I see it:
Samba team boldly declares that they provide SMB/CIFS services. Nothing more and definitely it is behind their scope to emulate CIFS of whatever Vista or XP.
In my opinion it would be wise of MS once to stop this "hacking" attitude (using things that work and slowly "documenting" them somewhere, somehow), and start working as normal engineers do in at least some big companies. For example to think before final design how things can go wrong.
vince: how many times do I have to repeat it? those structures aren’t part of CIFS, at all. They are part of the Windows IFS (Installable FileSystem) API, which is now officially and 100% public (and used to be mostly public – through books, websites, Bo Branten’s ntifs.h project, etc. – for ages). I have known about the new enumeration types for years now
And somehow, *inexplicably*, *unbelievably*, the Samba team has known about them for quite some time too!
David: it’s not my fault if you don’t understand how the CIFS protocol works
Roughly, the protocol has a packet type for each Windows I/O request. Most I/O requests have an unique way of passing parameters, so every new request necessitates a new packet type with an unique encoding. But many many many operations fall under the umbrella of very few requests – for example, all file enumeration operations (about a dozen) fall under a single request code (which also serves "query attributes" requests) and share the same parameters. CIFS is a file sharing protocol, so the expectation is that such parameters are passed unchanged to the local filesystem underlying a network share
A "change of the protocol" would be the introduction of a new request/packet type, which will have an unique and unseen encoding of inline parameters (notable exceptions are output data and status code, which are returned in the same identical way for all requests). And even in that case, you can fail the request in a standard way (with a STATUS_INVALID_DEVICE_REQUEST error code)
In fact, you have to design a CIFS server as a thin serialization layer over local services and drivers, because that’s the closest you can get to the real thing; then your underlying "driver" can handle (and fail) the requests like any standard Windows driver, in what is a publically documented behavior. Any other design can only be motivated by delusional concepts of "open protocols"
dorin: because. CIFS is designed as a very bare network serialization of Windows security (SSPI) and I/O (IFS) APIs. CIFS is not an open standard, CIFS is a matter of fact
See also the Single UNIX Specification: it purports to be an open standard, and you sure can implement it in a completely naive way, from scratch, clean-room, etc… but if you follow the original UNIX design, it’s so much easier
As for the "hacking" attitude, thinking "before", etc. IFS is probably the one and only I/O API that was designed since the beginning with native network filesystem capabilities in mind – "networked" attribute for files, "opened from network redirector" flag, "fast network query information" operation, etc. not to mention DMA support in the cache manager, whose entire motivation lies in efficiently transferring data from a filesystem to a NIC
PingBack from http://blogs.msdn.com/oldnewthing/archive/2006/10/03/785925.aspx