Date: | May 18, 2004 / year-entry #194 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040518-00/?p=39283 |
Comments: | 31 |
Summary: | Just because you have the ASCII table memorized doesn't mean you know how sorting works. I remember a bug filed where somebody said that the "sort" command was sorting underscores incorrectly: AAA__ AAAAA this was claimed to be wrong "because underscore character follow uppercase letters and precedes lowercase letters". Well perhaps it does if you... |
Just because you have the ASCII table memorized doesn't mean you know how sorting works. I remember a bug filed where somebody said that the "sort" command was sorting underscores incorrectly: AAA__ AAAAA this was claimed to be wrong "because underscore character follow uppercase letters and precedes lowercase letters". Well perhaps it does if you think ASCII sorting is the way sorting should be. ASCII sorting is so last century. The CompareString function describes two types of sorting, "word sort" and "string sort". And both of them sort punctuation before letters. |
Comments (31)
Comments are closed. |
There is also StrCmpLogicalW which does nice sorting if the strings contain numbers.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/shell/reference/shlwapi/string/strcmplogicalw.asp
For total chaos, try sorting Unicode (or even UTF-8) by numeral value ("ASCII code")… (how do you weigh multi-byte chars?)
Still, I’m sure there are still some pitifully Unicode-unaware apps out there that try this.
Raymond, does this tie into the fact that in Windows Explorer, "gf99999a" sorts before "gf100000"? Or is that something else altogether?
That’s something else altogether. See Jonathan Payne’s remark above.
And that’s not always Good Thing. If the creator of the files wanted them to sort by number, surely he/she would format them to a constant number of digits?
Centaur,
Sure, my mom should definitely know that if she has 10 pictures she should call them picture01.jpg through picture10.jpg so that they’re in the right order.
I told mine.
One down, 65 million to go…
I wonder whether her reaction was "That makes sense" or whether it was "Computers are hard to use"…
Jeremy, don’t you mean picture00.jpg through picture09.jpg? Everyone knows that computers like it better when you start counting things from zero ;-)
Of course computers are hard to use. That makes sense :)
John Clyman: if you named them like that, there would be no need for the leading zero. That proves your point :)
Anyway, a file listing just looks better with numbers nicely right-aligned, no?
No need for the leading zero, Centaur? Only until you add the 11th picture…
As an aside: While I prefer to have the computer do what the user expects rather than the converse, if I did have to instruct somebody on how to change their file naming scheme to make sorting work better, I’m not sure I’d choose the leading zeros approach. I’d probably opt instead for a trick like starting numbering with 1eN, where N is some suitably large number to accommodate future expansion. So the first picture could be 1000 (or 1001), and increment from there. I suspect most people would be comfortable with this approach, as it doesn’t require any fundamental behavior change once you’ve selected the starting point (assuming you left yourself enough digits for expansion).
These are all cute hacks, but you have to admit; they’re all just hacks. People walk away shaking their heads. "Computers are strange. You can’t call your first picture ‘picture number 1’. You have to call it ‘picture number 1000’. I’ll never understand this computer stuff. Why do they make it so hard to use?"
Ah. Missed that. Thanks!
Absolutely. "Do what I mean" (from the user’s perspective) is the way to go, and I was being playful with my earlier comments.
Of course, taking this a step further, one might ask why users even need to think about filenames for pictures in the first place…
As long as there’s a way to get things into ASCII-betic sequence, I’m happy. I’m working with C# libraries that are interfaced with old (*very* old) C code, and the expectation is that — in some cases — things sort the way they did under C.
Slightly off topic and I hope this is not a silly westerner’s question but can someone explain how sorting works in Asian characters, especially printed media for everyday use such as a phone book? I asked a Chinese friend this long ago and his answer was rather vague and I gathered that it was not a particular easy to simply look up someone up in a big phone book. I guess my question also applies to many other languages such as Arabic.
All this picture-naming-problem comes from one wrong decision. Namely, they were not meant to be numbered, they were to be given a /long descriptive name/.
John Clyman:
> why users even need to think about
> filenames for pictures
They don’t. If the pictures are from a photo camera, they are already nicely named like …, P5090039.JPG, P5090040.JPG, … Not that these are good, descriptive, easy to remember names, but they start with 0 or 1 and are sequential and sort sequentially.
Ross:
> how sorting works in Asian
> characters
Take, for example, Japanese kanji.
First, you determine the radical, the main part of the kanji. There are 214 or so radicals, they are all numbered, and the radical is the primary sort key.
Second, you count the strokes in the remaining part of the kanji. This becomes the secondary sort key.
When two kanji have the same radical and the same stroke count, well, there seems to be some tertiary sort key.
Also, one could sort by reading.
Now an example of “Do What I Mean”.
I created a bunch of files, like 00.txt, 01.txt, …, 09.txt, 0A.txt, 0B.txt. …, 0F.txt, 10.txt, and so on. Guess how they are sorted. Adding 0x at the left does not help. Adding h at the right does not help. Computers are so hard to use, when you expect them to do the logical thing, they suddenly decide they’re going to be smart and guess What I Mean (incorrectly). Yes — a typical user is unlikely to number files in hexadecimal. But then an atypical user has to use an atypical tool which is specifically created by atypical users for atypical users.
what is the big deal … try sorting in different languages, french or german etc.
the german umlaute are a big mess to sort (there are actually 2 ways of handling them, the phone book and the ‘duden’ (our grammar bible) do it differently …
to tell you the truth, being responsible for sorting algorithm in a database product about 15 years ago – I can’t get excited about punctations … just the top of the iceberg
what is the big deal … try sorting in different languages, french or german etc.
the german umlaute are a big mess to sort (there are actually 2 ways of handling them, the phone book and the ‘duden’ (our grammar bible) do it differently …
to tell you the truth, being responsible for sorting algorithm in a database product about 15 years ago – I can’t get excited about punctations … just the top of the iceberg
Riiight, what an extremely useful ASCII table that you have to scroll down page after page and only ever see a small part of it. But it’s got binary! =)
Here, let me show you some better ones to be found on the Net, like
http://www.asciitable.com/
or
http://www.torsten-horn.de/techdocs/ascii.htm
or
http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm
or even
http://web.cs.mun.ca/~michael/c/ascii-table.html
After looking at sorting Japanese, makes my comment about sorting Welsh consonants: dd, ch, ll, etc. a little bit redundant.
If you don’t like how the files are sorted, open up TweakUI, go to the Explorer node, then uncheck ‘Use intuitive filename sorting’ and hit OK.
TweakUI download: http://www.microsoft.com/windowsxp/pro/downloads/powertoys.asp
Now that’s a truly must-have tool by atypical users for atypical users :)
"I’ll never understand this computer stuff. Why do they make it so hard to use?"
The problem is that you cannot simply make computers totally "intuitive", just because the users do not have a logical and consistent way of thinking; humans also base their expectations on context and information that the computer does not even have. I would consider "totally intuitive computer interface" to be AI-complete task.
"I have those historical document files named 1500BC.doc, 66BC.doc, 17AD.doc, 300AD.doc, 1648AD. How could be the computer be so dumb it sorts them so strangely? It is obvious that those are just year numbers!" — You can always imagine a "naive user" that won’t understand some feature of the interface.
I for one (OK, I am a geek) would rather like to use a simple, consistent, logical (…) interface than an interface that tries to guess what I mean (not what I say) and does things in a way that I would not be able even to explain.
Okay, so files in Explorer are now sorted according to rules that I, as a geek, don’t understand anymore. So sorting was changed from something that (supposedly) only geeks could understand to something that virtually nobody understands. Bravo, what an improvement!
At least files sort the way people expect, don’t they? Sometimes they do. But it doesn’t work e.g. with fractional numbers (1.5 < 1.20) and dates.
Note that these wouldn’t work with classic file sorting, either. But with traditional sorting, at least I know WHY.
The problem is that when for certain fraction-like strings, it is not always obvious how they should be sorted. Is "v5.950" greater than or less than "v5.1106"? Even GNU leans towards using component-by-component digit sorting for filenames. (strverscmp http://www.delorie.com/gnu/docs/glibc/libc_75.html)
It seems the argument hinges on your feelings about the following statements:
"This isn’t a matter of life or death. Get the 99% case right." vs "If you can’t get 100%, then don’t even try."
"The computer should adapt to the way users do things" vs. "The user should adapt to the way computers do things."
$ ls
1.0 1.1 1.10 1.2 2.1
$ ls -v
1.0 1.1 1.2 1.10 2.1
Cool, learn a new thing every day.
Of course, /bin/ls is the only thing that uses it. Konqueror at least does not have that option (it only has the case sensitive/insensitive switch and the "folders first" mode).
The only other program I found on my box using it besides /bin/ls and /bin/dir was /usr/bin/a2ps, so I guess nobody really uses it.
I agree with Foo. Fractional numbers are irritating, as are other numbers.
I *suppose* one could argue for a sort routine that treated numerical subsequences as "numbers" and sorted those differently. But, cool as this might be, it would be even harder to describe to the end user. (and it doesn’t fix the date thing, but I don’t have lots of files named 304BC…).
*sniff* I liked ASCII.
Commenting on this entry has been closed.
Because "Sort according to the default sort order for the column whose title is… Name" is too long.