Date: | July 11, 2005 / year-entry #186 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20050711-14/?p=34973 |
Comments: | 10 |
Summary: | One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters. The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character... |
One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters.
The format of the (Note also that in real life, the mapping is more complicated than a character-for-character substitution, but I'm willing to forego that level of complexity because this is just for my personal use and people will have realized I'm not a native speaker long before I get caught up in language subtleties like that.)
One could try to work out a fancy data structure to represent
this mapping table compactly, but it turns out that simple is
better here: an array of 65536 It is important to know when a simple data structure is better than a complex one.
The #!perl $_ = <> until /^# Start zi/; # ignore uninteresting characters while (<>) { s/\r//g; next if length($_) == 7 && substr($_, 0, 3) eq substr($_, 3, 3); # ignore NOPs print; }
Run the Now we can write our "traditional to simplified" dictionary. class Trad2Simp { public: Trad2Simp(); WCHAR Map(WCHAR chTrad) const { return _rgwch[chTrad]; } private: WCHAR _rgwch[65536]; // woohoo! }; Trad2Simp::Trad2Simp() { ZeroMemory(_rgwch, sizeof(_rgwch)); MappedTextFile mtf(TEXT("hcutf8.txt")); const CHAR* pchBuf = mtf.Buffer(); const CHAR* pchEnd = pchBuf + mtf.Length(); while (pchBuf < pchEnd) { const CHAR* pchCR = std::find(pchBuf, pchEnd, '\r'); int cchBuf = (int)(pchCR - pchBuf); WCHAR szMap[80]; DWORD cch = MultiByteToWideChar(CP_UTF8, 0, pchBuf, cchBuf, szMap, 80); if (cch > 1) { WCHAR chSimp = szMap[0]; for (DWORD i = 1; i < cch; i++) { if (szMap[i] != chSimp) { _rgwch[szMap[i]] = chSimp; } } pchBuf = std::find(pchCR, pchEnd, '\n') + 1; } } _rgwch[0x9EBC] = 0x4E48; } We read the file one line at a time, convert it from UTF-8, and for each nontrivial mapping, record it in our dictionary. At the end, we do our little 么 special-case patch-up. Next time, we'll use this mapping table to generate simplified Chinese characters into our dictionary. |
Comments (10)
Comments are closed. |
Just out of interest, is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?
麼 and 麽 look almost the same to my untrained eye.
"is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?"
I don’t read any ideographic languages, but I do know that the default font size in Japanese Windows is larger than in American versions so the characters are easily distinguished, even by me.
And I’d guess that context plays a large part when reading. It certainly does in English.
Why do you go through the hassle of converting when the UTF-8 dictionary file lists both forms? Wouldn’t it be simpler to store each during parsing with a link between them?
The default system font size in CJK is 9 pt, 8 pt is too small for hanzi to be easily readable.
What does the "rg" Hungarian prefix mean? That one has always stumped me.
rg is "range" — it almost always means a pointer to the start of an array. i.e. "rgch" is a ptr to an array of chars, and "pch" is a ptr to somewhere inside that array.
This is a very complex thing to do and I think you should make it clear this is just a toy.
Here is a very interesting link: http://www.cjk.org/cjk/c2c/c2cbasis.htm
Notes and stories
1. Bad conversion are easy to identify for natives. I have worked more than 7 years in localization. During this time, I have tried several conversion tools. At some point I was comparing the MS conversion (from Office) vs. the "Chinese Partner" conversion. I liked the MS one, because it was able to convert RTF/DOC files. "Chinese Partner" was handling only text. Our Chinese lead translator received RTF files, and had no idea what I was using. But he was always able to to tell me "you are using the bad convertor again" every time I was using the MS tool.
2. "an array of 65536 WCHARs" = Unicode covers more than 64K (0x10FFFF) and a big chunk beyond BMP (the first plane of 64K) is exactly for Chinese. So you have to deal with surogates to get this right.
3. The font is important. During the Unicode many TC/SC character have been unified, and considered "glyph differences". But if one uses a SC font for TC text, a native will instantly ask "why do you use this SC character" (readers care about shape, not about Unicode code points). Example: the ideographic dot is at the base line or centered. Example: the "Bone" character (U+9AA8) changes even the stroke count (compare it with MingLiU and SimSun).
4. There are also vocabulary differences, not only glyph differences. Same with US and UK English.
Conversion between Traditional Chinese(CHT) and Simplified Chinese(CHS) is always difficult. To make the issue more complicated, there is 3 or 4 versions of offically announced CHS(with the last one maked unusable, they just attempted to use one simplied character to mean a word, for example, they want to use "T" in the middle of "門" to mean "question", and that’s a insane thing to do so as it’ll bring more evil than good.) A good convertor should be capable to deal with the difference. However, being be able to handle version (2 or 3 I can’t remember) will be adequate for most situations.
Converting from CHT to CHS is simple and good to start with, while converting from CHS to CHT is impossible(with good result) without a 3-pass conversion algorithm.
In Japanese copies of Chinese characters there are a number of cases where characters were miscopied and then both the common Japanese form and the originally correct Chinese form are accepted as alternates. I didn’t notice that the same kind of miscopying might have happened inside China and there might be multiple accepted alternates just among the original Chinese forms. But either way, when seeing the three characters in this posting, it looks like two of them were traditional alternates and one is a simplified form. I mean, why make a decision at a governmental level to make a minor change and call it a simplification, at the same time as making a major change and actual simplification for the same character?
Or maybe one government made a big simplification and another one or two made a lesser simplification? In that case we need to abandon the terms "simplified" and "traditional" and use the names of countries or regions to designate which kind of Chinese is in use. I guess everyone agrees that Hong Kong is a region and Singapore is a country, but does the Republic of China allow sale of products that mention the existence of the People’s Republic of China and vice-versa. Otherwise maybe just "Chinese-1", "Chinese-2", etc. Oh no, who gets higher numbers.
Sorry I forgot to add this to my previous comment:
Monday, July 11, 2005 2:00 PM by Mihai
> I liked the MS one, because it was able to
> convert RTF/DOC files.
[…]
> But he was always able to to tell me "you
> are using the bad convertor again"
The second time I experimented with OpenOffice, it could read Japanese .doc files (from which version of Microsoft Word I forgot) but couldn’t read Japanese .rtf files. This seems backwards since .rtf is a known simple format, but that was the case. Then I even figured out that a previous correspondent must have been using OpenOffice, because I had originally sent an .rtf file and a request came back for a .doc file because the .rtf had been unreadable.
Stephen asks via the Contact link: I’m making a program doing a Traditional/Simplified Chinese conversion