Converting from traditional to simplified Chinese, part 1: Loading the dictionary

class Trad2Simp { public: Trad2Simp(); WCHAR Map(WCHAR chTrad) const { return _rgwch[chTrad]; } private: WCHAR _rgwch[65536]; // woohoo! }; Trad2Simp::Trad2Simp() { ZeroMemory(_rgwch, sizeof(_rgwch)); MappedTextFile mtf(TEXT("hcutf8.txt")); const CHAR* pchBuf = mtf.Buffer(); const CHAR* pchEnd = pchBuf + mtf.Length(); while (pchBuf < pchEnd) { const CHAR* pchCR = std::find(pchBuf, pchEnd, '\r'); int cchBuf = (int)(pchCR - pchBuf); WCHAR szMap[80]; DWORD cch = MultiByteToWideChar(CP_UTF8, 0, pchBuf, cchBuf, szMap, 80); if (cch > 1) { WCHAR chSimp = szMap[0]; for (DWORD i = 1; i < cch; i++) { if (szMap[i] != chSimp) { _rgwch[szMap[i]] = chSimp; } } pchBuf = std::find(pchCR, pchEnd, '\n') + 1; } } _rgwch[0x9EBC] = 0x4E48; }

Comments (10)

Mat says:

July 11, 2005 at 10:21 am

Just out of interest, is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?

麼 and 麽 look almost the same to my untrained eye.
Carlos says:

July 11, 2005 at 11:10 am

"is it easy for a native speaker to distinguish between similar looking intricate characters when written so small?"

I don’t read any ideographic languages, but I do know that the default font size in Japanese Windows is larger than in American versions so the characters are easily distinguished, even by me.

And I’d guess that context plays a large part when reading. It certainly does in English.
Stephen Sprunk says:

July 11, 2005 at 11:36 am

Why do you go through the hassle of converting when the UTF-8 dictionary file lists both forms? Wouldn’t it be simpler to store each during parsing with a link between them?
Mike Dunn says:

July 11, 2005 at 11:54 am

The default system font size in CJK is 9 pt, 8 pt is too small for hanzi to be easily readable.

What does the "rg" Hungarian prefix mean? That one has always stumped me.
ryanmy says:

July 11, 2005 at 12:07 pm

rg is "range" — it almost always means a pointer to the start of an array. i.e. "rgch" is a ptr to an array of chars, and "pch" is a ptr to somewhere inside that array.
Mihai says:

July 11, 2005 at 2:00 pm

This is a very complex thing to do and I think you should make it clear this is just a toy.

Here is a very interesting link: http://www.cjk.org/cjk/c2c/c2cbasis.htm

Notes and stories

1. Bad conversion are easy to identify for natives. I have worked more than 7 years in localization. During this time, I have tried several conversion tools. At some point I was comparing the MS conversion (from Office) vs. the "Chinese Partner" conversion. I liked the MS one, because it was able to convert RTF/DOC files. "Chinese Partner" was handling only text. Our Chinese lead translator received RTF files, and had no idea what I was using. But he was always able to to tell me "you are using the bad convertor again" every time I was using the MS tool.

2. "an array of 65536 WCHARs" = Unicode covers more than 64K (0x10FFFF) and a big chunk beyond BMP (the first plane of 64K) is exactly for Chinese. So you have to deal with surogates to get this right.

3. The font is important. During the Unicode many TC/SC character have been unified, and considered "glyph differences". But if one uses a SC font for TC text, a native will instantly ask "why do you use this SC character" (readers care about shape, not about Unicode code points). Example: the ideographic dot is at the base line or centered. Example: the "Bone" character (U+9AA8) changes even the stroke count (compare it with MingLiU and SimSun).

4. There are also vocabulary differences, not only glyph differences. Same with US and UK English.
Cheong says:

July 11, 2005 at 10:46 pm

Conversion between Traditional Chinese(CHT) and Simplified Chinese(CHS) is always difficult. To make the issue more complicated, there is 3 or 4 versions of offically announced CHS(with the last one maked unusable, they just attempted to use one simplied character to mean a word, for example, they want to use "T" in the middle of "門" to mean "question", and that’s a insane thing to do so as it’ll bring more evil than good.) A good convertor should be capable to deal with the difference. However, being be able to handle version (2 or 3 I can’t remember) will be adequate for most situations.

Converting from CHT to CHS is simple and good to start with, while converting from CHS to CHT is impossible(with good result) without a 3-pass conversion algorithm.
Norman Diamond says:

July 12, 2005 at 12:14 am

In Japanese copies of Chinese characters there are a number of cases where characters were miscopied and then both the common Japanese form and the originally correct Chinese form are accepted as alternates. I didn’t notice that the same kind of miscopying might have happened inside China and there might be multiple accepted alternates just among the original Chinese forms. But either way, when seeing the three characters in this posting, it looks like two of them were traditional alternates and one is a simplified form. I mean, why make a decision at a governmental level to make a minor change and call it a simplification, at the same time as making a major change and actual simplification for the same character?

Or maybe one government made a big simplification and another one or two made a lesser simplification? In that case we need to abandon the terms "simplified" and "traditional" and use the names of countries or regions to designate which kind of Chinese is in use. I guess everyone agrees that Hong Kong is a region and Singapore is a country, but does the Republic of China allow sale of products that mention the existence of the People’s Republic of China and vice-versa. Otherwise maybe just "Chinese-1", "Chinese-2", etc. Oh no, who gets higher numbers.
Norman Diamond says:

July 12, 2005 at 12:18 am

Sorry I forgot to add this to my previous comment:

Monday, July 11, 2005 2:00 PM by Mihai

> I liked the MS one, because it was able to

> convert RTF/DOC files.

[…]

> But he was always able to to tell me "you

> are using the bad convertor again"

The second time I experimented with OpenOffice, it could read Japanese .doc files (from which version of Microsoft Word I forgot) but couldn’t read Japanese .rtf files. This seems backwards since .rtf is a known simple format, but that was the case. Then I even figured out that a previous correspondent must have been using OpenOffice, because I had originally sent an .rtf file and a request came back for a .doc file because the .rtf had been unreadable.
Sorting It All Out says:

October 22, 2007 at 10:26 am

Stephen asks via the Contact link: I’m making a program doing a Traditional/Simplified Chinese conversion

Comments are closed.

Date:	July 11, 2005 / year-entry #186
Tags:	code
Orig Link:	https://blogs.msdn.microsoft.com/oldnewthing/20050711-14/?p=34973
Comments:	10
Summary:	One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters. The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character...