Case mapping on Unicode is hard

Date:September 5, 2003 / year-entry #48
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20030905-00/?p=42643
Comments:    2
Summary:Occasionally, I'm asked, "I have to identify strings that are identical, case-insensitively.  How do I do it?" The answer is, "Well, it depends. Whose case-mapping rules do you want to use?" Sometimes the reply is, "I want this to be language-independent." Now you have a real problem. Every locale has its own case-mapping rules. Many...

Occasionally, I'm asked, "I have to identify strings that are identical, case-insensitively.  How do I do it?"

The answer is, "Well, it depends. Whose case-mapping rules do you want to use?"

Sometimes the reply is, "I want this to be language-independent."

Now you have a real problem.

Every locale has its own case-mapping rules. Many of them are in conflict with the rules for other locales. For example, which of the the following pairs of words compare case-insensitive equal?

1. gif GIF
2. Maße MASSE
3. Maße Masse
4. même MEME

Answers:

  1. no in Turkey, yes in US
  2. no in US, yes in Germany
  3. no in US, no in Germany, yes in Switzerland! (Though you would likely never see it written as "Maße" in Switzerland.)
  4. yes in France, no in Quebec!

(And I've heard that the capitalization rules for German are context-sensitive. Maybe that changed with the most recent spelling reform.) Unicode Technical Report #21 has more examples.

Just because you're using Unicode doesn't mean that all your language problems are solved. Indeed, the ability to represent characters in nearly all of the world's languages means that you have more things to worry about, not less.


Comments (2)
  1. David says:

    Raymond, I am no expert in Unicode, but from a seminar I attended a few years ago and from http://www.unicode.org/reports/tr10/#French_Accents it would appear that your example 4 is incorrect (ie no in France). This is an oddity of the French language that I do not remember having learnt at school in France, but discovered while living in England!

  2. Raymond Chen says:

    It is my recollection that French rules for capitalization is that when an accented character is converted to uppercase, it loses its accent mark. Therefore, a capital E compares equal to any accented lowercase e, because there is no such thing as an accented capital E. (However, an unaccented e does not compare equal to an accented e.) French Canadian, however, preserves the accent mark on capitalization. That was the rule I was trying to exhibit (and failed).

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index