Date: | May 31, 2004 / year-entry #214 |
Tags: | history |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040531-00/?p=39103 |
Comments: | 14 |
Summary: | Reader Ben Hutchings wanted to know why the 8-bit codepage is called "ANSI" when it isn't actually ANSI. But instead of saying, "Oh well, some things mortals were never meant to know," he went and dug up the answer himself. A quick Google for Windows ANSI misnomer found me exactly what I was looking for... |
Reader Ben Hutchings wanted to know why the 8-bit codepage is called "ANSI" when it isn't actually ANSI. But instead of saying, "Oh well, some things mortals were never meant to know," he went and dug up the answer himself.
|
Comments (14)
Comments are closed. |
I remember that thread! In it, I explained in greater detail the difference between ISO-8859-1 and Windows Latin 1 (a.k.a. Windows-1252, or CP1252) — although I neglected to mention why the latter is commonly called "ANSI":
http://blogs.msdn.com/oldnewthing/archive/2004/03/19/92648.aspx
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/gdi/devcons_1t10.asp
GetStockObject is missing NULL_PEN in the table for some reason.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/shell/reference/functions/dragqueryfile.asp
DragQueryFile: In the remarks, "Note that the index variable itself returns unchanged, and will therefore remain 0xFFFFFFFF". Duh, it’s passed by value.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/commctls/updown/updown.asp
It says "the full 32-bit range" on the bottom of the page but then it lists -0x7FFFFFFF to +0x7FFFFFFF. The full 32-bit range is actually -0x80000000 to +0x7FFFFFFF.
I think it’s time to make an "Unrelated comments" entry so people won’t have to hijack other entries…
NULL_PEN: Odd indeed.
DragQF: And yet people complain when the documentation doesn’t state the obvious.
UpDown: I’ll have to check what the true range is.
The ‘true range’ will depend on whether your processor does one’s- or two’s-complement arithmetic (although everything Windows currently runs on is two’s-complement). One’s complement has the odd property that you can actually represent -0 (it has the bit pattern 0xFFFFFFFF for a 32-bit number).
For a bonus point – why is Windows’ use of the term Unicode also a misnomer?
My answer: Unicode refers to an abstract, logical coding of characters and character components. The physical two-byte-code-unit encoding used by Windows 2000 and earlier is UCS-2 (Universal Character Set encoded in units of 2 bytes) while that used by Windows XP and later is UTF-16 (Unicode Transformation Format, 16-bit). The difference is that UTF-16 introduces surrogates for characters whose abstract code is greater than U+FFFF – these surrogates use two encoding units of 16 bits each to represent a single Unicode code point.
When Windows/MSDN documentation refers to Unicode, UCS-2 or UTF-16 is almost always the meaning intended. Windows doesn’t appear to support UTF-32/UCS-4 as a possible encoding.
To simplify (!) things, I refer to the traditional encodings as byte-oriented character sets – because there are characters encoded which only require one byte in the encoding. UTF-16 is a WORD-oriented encoding because each character requires a multiple of 2 bytes to encode (either a single 2-byte code encoding a single character, or two 2-byte codes making up a surrogate pair).
In Windows documentation you’ll also see the terms SBCS, DBCS and MBCS (single-byte character set, double-byte character set, multi-byte character set). DBCS is really a misnomer because most DBCS sets have some characters encoded with a single byte. MBCS is a covering term for SBCS and DBCS.
Oh boy. At least with UCS-2 you knew that one base unit = one code point. I suppose it doesn’t make that much of a difference, nowhere does Unicode guarantee that one code point is one glyph or one basic lingual concept. Sure, now you can represent every language at once, but you still have to worry about slicing things. String handling sucks.
and fwiw, C++ doesn’t guarantee that -0x80000000 will be in the range of a 32-bit integer either.
and irritatingly windows is quite happy to label it’s windows charactor sets as "ascii" or "iso-8859-1" in things like email messages, or in web pages (both served by a web server, or submitted by a web browser).
This makes non-microsoft OS’s have ?’s or square boxes appear all over the place as they encounter invalid charactors all over the place.
Perry: Microsoft’s applications used to do that but they now seem to be quite consistent in using the correct names like "windows-1252". You can also choose whether the standard or proprietary encoding is used: "Western European (ISO)" is ISO 8859-1 whereas "Western European (Windows)" is code page 1252.
At least with the UTF series, you can walk a string backwards. With UTF-8 (encoded using the canonical representation – it is possible to encode characters illegally) you can tell whether a unit represents a single code point, or a trailing byte, or the lead byte of a two-, three- or four-unit encoding. Single code points are always under 0x80, trail bytes are between 0x80 and 0xBF, lead bytes of a two-byte encoding are between 0xC2 and 0xDF, three-byte between 0xE1 and 0xEF and four-byte between 0xF0 and 0xF7. The pattern is basically (binary):
0xxxxxxx = single byte
10xxxxxx = trail byte
110xxxxx = two-byte lead byte
1110xxxx = three-byte lead byte
11110xxx = four-byte lead byte
UTF-16 uses the values 0xDC00 – 0xDFFF for the leading surrogate unit and 0xD800 – 0xDBFF for the trailing unit. These values are reserved in the logical encoding.
Meh. Most MBCS that I know of use one or two bytes for each character and you can at least identify lead bytes. It’s only marginally more difficult to walk a string backwards in that case. And you need to scan all your strings ahead of time to make sure they’re valid UTF before you can really take advantage of it.
Wait, no, I guess you usually can’t… You can distinguish a lead byte from ascii, but not necessarily from a trail byte. :/
Because it once was, though no longer is.
PingBack from http://smallcode.weblogs.us/2006/10/25/code-page-for-win32-console-programs/