Converting from traditional to simplified Chinese, part 1: Loading the dictionary

Date:July 11, 2005 / year-entry #186
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20050711-14/?p=34973
Comments:    10
Summary:One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters. The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character...

One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters.

The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character followed by its traditional equivalents. Often, multiple traditional characters map to a single simplified character. Much more rarely—only twice in our data set—multiple simplified characters map to a single traditional character. Unfortunately, one of the cases is the common syllable 麼, which has two simplifications, either 么 or 麽, the first of which is far more productive. We'll have to keep an eye out for that one.

(Note also that in real life, the mapping is more complicated than a character-for-character substitution, but I'm willing to forego that level of complexity because this is just for my personal use and people will have realized I'm not a native speaker long before I get caught up in language subtleties like that.)

One could try to work out a fancy data structure to represent this mapping table compactly, but it turns out that simple is better here: an array of 65536 WCHARs, each producing the corresponding simplification. Most of the array will lie unused, since the characters we are interested in lie in the range U+4E00 to U+9FFF. Consequently, the active part of the table is only about 40Kb, which easily fits inside the L2 cache.

It is important to know when a simple data structure is better than a complex one.

The hcutf8.txt file contains a lot of fluff that we aren't interested in. Let's strip that out ahead of time so that we don't waste our time parsing it at run-time.

#!perl
$_ = <> until /^# Start zi/; # ignore uninteresting characters
while (<>) {
 s/\r//g;
 next if length($_) == 7 &&
         substr($_, 0, 3) eq substr($_, 3, 3); # ignore NOPs
 print;
}

Run the hcutf8.txt file through this filter to clean it up a bit.

Now we can write our "traditional to simplified" dictionary.

class Trad2Simp
{
public:
 Trad2Simp();
 WCHAR Map(WCHAR chTrad) const { return _rgwch[chTrad]; }

private:
 WCHAR _rgwch[65536]; // woohoo!
};

Trad2Simp::Trad2Simp()
{
 ZeroMemory(_rgwch, sizeof(_rgwch));

 MappedTextFile mtf(TEXT("hcutf8.txt"));
 const CHAR* pchBuf = mtf.Buffer();
 const CHAR* pchEnd = pchBuf + mtf.Length();
 while (pchBuf < pchEnd) {
  const CHAR* pchCR = std::find(pchBuf, pchEnd, '\r');
  int cchBuf = (int)(pchCR - pchBuf);
  WCHAR szMap[80];
  DWORD cch = MultiByteToWideChar(CP_UTF8, 0, pchBuf, cchBuf,
                                  szMap, 80);
  if (cch > 1) {
   WCHAR chSimp = szMap[0];
   for (DWORD i = 1; i < cch; i++) {
    if (szMap[i] != chSimp) {
     _rgwch[szMap[i]] = chSimp;
    }
   }
   pchBuf = std::find(pchCR, pchEnd, '\n') + 1;
  }
 }
 _rgwch[0x9EBC] = 0x4E48;
}

We read the file one line at a time, convert it from UTF-8, and for each nontrivial mapping, record it in our dictionary. At the end, we do our little 么 special-case patch-up.

Next time, we'll use this mapping table to generate simplified Chinese characters into our dictionary.


Comments (10)
  1. Mat says:

    Just out of interest, is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?

    麼 and 麽 look almost the same to my untrained eye.

  2. Carlos says:

    "is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?"

    I don’t read any ideographic languages, but I do know that the default font size in Japanese Windows is larger than in American versions so the characters are easily distinguished, even by me.

    And I’d guess that context plays a large part when reading. It certainly does in English.

  3. Stephen Sprunk says:

    Why do you go through the hassle of converting when the UTF-8 dictionary file lists both forms? Wouldn’t it be simpler to store each during parsing with a link between them?

  4. Mike Dunn says:

    The default system font size in CJK is 9 pt, 8 pt is too small for hanzi to be easily readable.

    What does the "rg" Hungarian prefix mean? That one has always stumped me.

  5. ryanmy says:

    rg is "range" — it almost always means a pointer to the start of an array. i.e. "rgch" is a ptr to an array of chars, and "pch" is a ptr to somewhere inside that array.

  6. Mihai says:

    This is a very complex thing to do and I think you should make it clear this is just a toy.

    Here is a very interesting link: http://www.cjk.org/cjk/c2c/c2cbasis.htm

    Notes and stories

    1. Bad conversion are easy to identify for natives. I have worked more than 7 years in localization. During this time, I have tried several conversion tools. At some point I was comparing the MS conversion (from Office) vs. the "Chinese Partner" conversion. I liked the MS one, because it was able to convert RTF/DOC files. "Chinese Partner" was handling only text. Our Chinese lead translator received RTF files, and had no idea what I was using. But he was always able to to tell me "you are using the bad convertor again" every time I was using the MS tool.

    2. "an array of 65536 WCHARs" = Unicode covers more than 64K (0x10FFFF) and a big chunk beyond BMP (the first plane of 64K) is exactly for Chinese. So you have to deal with surogates to get this right.

    3. The font is important. During the Unicode many TC/SC character have been unified, and considered "glyph differences". But if one uses a SC font for TC text, a native will instantly ask "why do you use this SC character" (readers care about shape, not about Unicode code points). Example: the ideographic dot is at the base line or centered. Example: the "Bone" character (U+9AA8) changes even the stroke count (compare it with MingLiU and SimSun).

    4. There are also vocabulary differences, not only glyph differences. Same with US and UK English.

  7. Cheong says:

    Conversion between Traditional Chinese(CHT) and Simplified Chinese(CHS) is always difficult. To make the issue more complicated, there is 3 or 4 versions of offically announced CHS(with the last one maked unusable, they just attempted to use one simplied character to mean a word, for example, they want to use "T" in the middle of "門" to mean "question", and that’s a insane thing to do so as it’ll bring more evil than good.) A good convertor should be capable to deal with the difference. However, being be able to handle version (2 or 3 I can’t remember) will be adequate for most situations.

    Converting from CHT to CHS is simple and good to start with, while converting from CHS to CHT is impossible(with good result) without a 3-pass conversion algorithm.

  8. Norman Diamond says:

    In Japanese copies of Chinese characters there are a number of cases where characters were miscopied and then both the common Japanese form and the originally correct Chinese form are accepted as alternates. I didn’t notice that the same kind of miscopying might have happened inside China and there might be multiple accepted alternates just among the original Chinese forms. But either way, when seeing the three characters in this posting, it looks like two of them were traditional alternates and one is a simplified form. I mean, why make a decision at a governmental level to make a minor change and call it a simplification, at the same time as making a major change and actual simplification for the same character?

    Or maybe one government made a big simplification and another one or two made a lesser simplification? In that case we need to abandon the terms "simplified" and "traditional" and use the names of countries or regions to designate which kind of Chinese is in use. I guess everyone agrees that Hong Kong is a region and Singapore is a country, but does the Republic of China allow sale of products that mention the existence of the People’s Republic of China and vice-versa. Otherwise maybe just "Chinese-1", "Chinese-2", etc. Oh no, who gets higher numbers.

  9. Norman Diamond says:

    Sorry I forgot to add this to my previous comment:

    Monday, July 11, 2005 2:00 PM by Mihai

    > I liked the MS one, because it was able to

    > convert RTF/DOC files.

    […]

    > But he was always able to to tell me "you

    > are using the bad convertor again"

    The second time I experimented with OpenOffice, it could read Japanese .doc files (from which version of Microsoft Word I forgot) but couldn’t read Japanese .rtf files. This seems backwards since .rtf is a known simple format, but that was the case. Then I even figured out that a previous correspondent must have been using OpenOffice, because I had originally sent an .rtf file and a request came back for a .doc file because the .rtf had been unreadable.

  10. Stephen asks via the Contact link: I’m making a program doing a Traditional/Simplified Chinese conversion

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index