String sorting is not done by ASCII code any more.

Date:May 18, 2004 / year-entry #194
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20040518-00/?p=39283
Comments:    31
Summary:Just because you have the ASCII table memorized doesn't mean you know how sorting works. I remember a bug filed where somebody said that the "sort" command was sorting underscores incorrectly: AAA__ AAAAA this was claimed to be wrong "because underscore character follow uppercase letters and precedes lowercase letters". Well perhaps it does if you...

Just because you have the ASCII table memorized doesn't mean you know how sorting works.

I remember a bug filed where somebody said that the "sort" command was sorting underscores incorrectly:

AAA__
AAAAA

this was claimed to be wrong "because underscore character follow uppercase letters and precedes lowercase letters".

Well perhaps it does if you think ASCII sorting is the way sorting should be.

ASCII sorting is so last century.

The CompareString function describes two types of sorting, "word sort" and "string sort". And both of them sort punctuation before letters.


Comments (31)
  1. Wilhelm Svenselius says:

    For total chaos, try sorting Unicode (or even UTF-8) by numeral value ("ASCII code")… (how do you weigh multi-byte chars?)

    Still, I’m sure there are still some pitifully Unicode-unaware apps out there that try this.

  2. Raymond, does this tie into the fact that in Windows Explorer, "gf99999a" sorts before "gf100000"? Or is that something else altogether?

  3. Raymond Chen says:

    That’s something else altogether. See Jonathan Payne’s remark above.

  4. Centaur says:

    And that’s not always Good Thing. If the creator of the files wanted them to sort by number, surely he/she would format them to a constant number of digits?

  5. Jeremy Morton says:

    Centaur,

    Sure, my mom should definitely know that if she has 10 pictures she should call them picture01.jpg through picture10.jpg so that they’re in the right order.

  6. Foo says:

    I told mine.

  7. Raymond Chen says:

    One down, 65 million to go…

    I wonder whether her reaction was "That makes sense" or whether it was "Computers are hard to use"…

  8. John Clyman says:

    Jeremy, don’t you mean picture00.jpg through picture09.jpg? Everyone knows that computers like it better when you start counting things from zero ;-)

  9. Centaur says:

    Of course computers are hard to use. That makes sense :)

    John Clyman: if you named them like that, there would be no need for the leading zero. That proves your point :)

    Anyway, a file listing just looks better with numbers nicely right-aligned, no?

  10. John Clyman says:

    No need for the leading zero, Centaur? Only until you add the 11th picture…

    As an aside: While I prefer to have the computer do what the user expects rather than the converse, if I did have to instruct somebody on how to change their file naming scheme to make sorting work better, I’m not sure I’d choose the leading zeros approach. I’d probably opt instead for a trick like starting numbering with 1eN, where N is some suitably large number to accommodate future expansion. So the first picture could be 1000 (or 1001), and increment from there. I suspect most people would be comfortable with this approach, as it doesn’t require any fundamental behavior change once you’ve selected the starting point (assuming you left yourself enough digits for expansion).

  11. Raymond Chen says:

    These are all cute hacks, but you have to admit; they’re all just hacks. People walk away shaking their heads. "Computers are strange. You can’t call your first picture ‘picture number 1’. You have to call it ‘picture number 1000’. I’ll never understand this computer stuff. Why do they make it so hard to use?"

  12. Ah. Missed that. Thanks!

  13. John Clyman says:

    Absolutely. "Do what I mean" (from the user’s perspective) is the way to go, and I was being playful with my earlier comments.

    Of course, taking this a step further, one might ask why users even need to think about filenames for pictures in the first place…

  14. As long as there’s a way to get things into ASCII-betic sequence, I’m happy. I’m working with C# libraries that are interfaced with old (*very* old) C code, and the expectation is that — in some cases — things sort the way they did under C.

  15. Ross says:

    Slightly off topic and I hope this is not a silly westerner’s question but can someone explain how sorting works in Asian characters, especially printed media for everyday use such as a phone book? I asked a Chinese friend this long ago and his answer was rather vague and I gathered that it was not a particular easy to simply look up someone up in a big phone book. I guess my question also applies to many other languages such as Arabic.

  16. Centaur says:

    All this picture-naming-problem comes from one wrong decision. Namely, they were not meant to be numbered, they were to be given a /long descriptive name/.

    John Clyman:

    > why users even need to think about

    > filenames for pictures

    They don’t. If the pictures are from a photo camera, they are already nicely named like …, P5090039.JPG, P5090040.JPG, … Not that these are good, descriptive, easy to remember names, but they start with 0 or 1 and are sequential and sort sequentially.

    Ross:

    > how sorting works in Asian

    > characters

    Take, for example, Japanese kanji.

    First, you determine the radical, the main part of the kanji. There are 214 or so radicals, they are all numbered, and the radical is the primary sort key.

    Second, you count the strokes in the remaining part of the kanji. This becomes the secondary sort key.

    When two kanji have the same radical and the same stroke count, well, there seems to be some tertiary sort key.

    Also, one could sort by reading.

    Now an example of “Do What I Mean”.

    I created a bunch of files, like 00.txt, 01.txt, …, 09.txt, 0A.txt, 0B.txt. …, 0F.txt, 10.txt, and so on. Guess how they are sorted. Adding 0x at the left does not help. Adding h at the right does not help. Computers are so hard to use, when you expect them to do the logical thing, they suddenly decide they’re going to be smart and guess What I Mean (incorrectly). Yes — a typical user is unlikely to number files in hexadecimal. But then an atypical user has to use an atypical tool which is specifically created by atypical users for atypical users.

  17. what is the big deal … try sorting in different languages, french or german etc.

    the german umlaute are a big mess to sort (there are actually 2 ways of handling them, the phone book and the ‘duden’ (our grammar bible) do it differently …

    to tell you the truth, being responsible for sorting algorithm in a database product about 15 years ago – I can’t get excited about punctations … just the top of the iceberg

  18. what is the big deal … try sorting in different languages, french or german etc.

    the german umlaute are a big mess to sort (there are actually 2 ways of handling them, the phone book and the ‘duden’ (our grammar bible) do it differently …

    to tell you the truth, being responsible for sorting algorithm in a database product about 15 years ago – I can’t get excited about punctations … just the top of the iceberg

  19. Florian says:

    Riiight, what an extremely useful ASCII table that you have to scroll down page after page and only ever see a small part of it. But it’s got binary! =)

    Here, let me show you some better ones to be found on the Net, like

    http://www.asciitable.com/

    or

    http://www.torsten-horn.de/techdocs/ascii.htm

    or

    http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

    or even

    http://web.cs.mun.ca/~michael/c/ascii-table.html

  20. Rhys says:

    After looking at sorting Japanese, makes my comment about sorting Welsh consonants: dd, ch, ll, etc. a little bit redundant.

  21. Moi says:

    You can’t call your first picture ‘picture number 1’. You have to call it ‘picture number 1000’

    They’d be better off calling it "Nice picture of my mother" or something else descriptive…

  22. Mike Dimmick says:

    If you don’t like how the files are sorted, open up TweakUI, go to the Explorer node, then uncheck ‘Use intuitive filename sorting’ and hit OK.

    TweakUI download: http://www.microsoft.com/windowsxp/pro/downloads/powertoys.asp

  23. Centaur says:

    Now that’s a truly must-have tool by atypical users for atypical users :)

  24. Petr Kadlec says:

    "I’ll never understand this computer stuff. Why do they make it so hard to use?"

    The problem is that you cannot simply make computers totally "intuitive", just because the users do not have a logical and consistent way of thinking; humans also base their expectations on context and information that the computer does not even have. I would consider "totally intuitive computer interface" to be AI-complete task.

    "I have those historical document files named 1500BC.doc, 66BC.doc, 17AD.doc, 300AD.doc, 1648AD. How could be the computer be so dumb it sorts them so strangely? It is obvious that those are just year numbers!" — You can always imagine a "naive user" that won’t understand some feature of the interface.

    I for one (OK, I am a geek) would rather like to use a simple, consistent, logical (…) interface than an interface that tries to guess what I mean (not what I say) and does things in a way that I would not be able even to explain.

  25. Foo says:

    Okay, so files in Explorer are now sorted according to rules that I, as a geek, don’t understand anymore. So sorting was changed from something that (supposedly) only geeks could understand to something that virtually nobody understands. Bravo, what an improvement!

    At least files sort the way people expect, don’t they? Sometimes they do. But it doesn’t work e.g. with fractional numbers (1.5 < 1.20) and dates.

    Note that these wouldn’t work with classic file sorting, either. But with traditional sorting, at least I know WHY.

  26. Raymond Chen says:

    The problem is that when for certain fraction-like strings, it is not always obvious how they should be sorted. Is "v5.950" greater than or less than "v5.1106"? Even GNU leans towards using component-by-component digit sorting for filenames. (strverscmp http://www.delorie.com/gnu/docs/glibc/libc_75.html)

    It seems the argument hinges on your feelings about the following statements:

    "This isn’t a matter of life or death. Get the 99% case right." vs "If you can’t get 100%, then don’t even try."

    "The computer should adapt to the way users do things" vs. "The user should adapt to the way computers do things."

  27. Cesar Eduardo Barros says:

    $ ls

    1.0 1.1 1.10 1.2 2.1

    $ ls -v

    1.0 1.1 1.2 1.10 2.1

    Cool, learn a new thing every day.

    Of course, /bin/ls is the only thing that uses it. Konqueror at least does not have that option (it only has the case sensitive/insensitive switch and the "folders first" mode).

    The only other program I found on my box using it besides /bin/ls and /bin/dir was /usr/bin/a2ps, so I guess nobody really uses it.

  28. Marc Wallace says:

    I agree with Foo. Fractional numbers are irritating, as are other numbers.

    I *suppose* one could argue for a sort routine that treated numerical subsequences as "numbers" and sorted those differently. But, cool as this might be, it would be even harder to describe to the end user. (and it doesn’t fix the date thing, but I don’t have lots of files named 304BC…).

    *sniff* I liked ASCII.

  29. Raymond Chen says:

    Commenting on this entry has been closed.

  30. Because "Sort according to the default sort order for the column whose title is… Name" is too long.

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index