Date: | March 19, 2004 / year-entry #107 |
Tags: | history |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040319-00/?p=40173 |
Comments: | 12 |
Summary: | As we learned in a previous entry, string resources group strings into bundles of 16, each Unicode string in the bundle prefixed by a 16-bit length. Why does the Resource Compiler complain about strings longer than 255 characters? This is another leftover from 16-bit Windows. Back in the Win16 days, string resources were also grouped... |
This is another leftover from 16-bit Windows. Back in the Win16 days, string resources were also grouped into bundles of 16, but the strings were in ANSI, not Unicode, and the prefix was only an 8-bit value. And 255 is the largest length you can encode in an 8-bit value. If your 32-bit DLL contains strings longer than 255 characters, then 16-bit programs would be unable to read those strings. This is largely irrelevant nowadays, but the warning remained in the Resource Compiler for quite some time. It appears to be gone now. Good riddance. |
Comments (12)
Comments are closed. |
Why do Microsoft employees call the system’s single-byte or multi-byte character encoding "ANSI" when it is never (AFAIK) an ANSI standard encoding?
I always call ’em ASCII….
Sorry, I misread your post. I thoguht you were just referring to single byte.
Actually, Windows Latin 1 (a.k.a. Windows-1252, or CP1252) is different from ISO-8859-1; it contains 27 printable glyphs in the 80h-to-9Fh range, where ISO-8859-1 contains nonprintable control codes.
The characters defined in ISO-8859-1 correspond exactly to the first 256 Unicode code points, while the Windows Latin 1 characters in the 80h-to-9Fh range correspond to Unicode code points scattered all over the place.
Actually we typically call them ANSI because the actual interpretation of CHAR * strings is subject to CP_ACP ("the ANSI code page").
Why is it called the ANSI code page? I dunno. I feel fortunate to have avoided the whole 16-bit era myself except for a few questions when I interviewed.
ASCII is a 7-bit character set; and most code assumes that codes in the range 0-127 in a MBCS environment are the ASCII equivalents. I think that this assumption is so widely distributed that this is probably why we don’t have very good MBCS support for encodings where this assumption is not true.
> Why is it called the ANSI code page?
Why *they* are called the ANSI code page, better to say, because there isn’t one fixed “ANSI” code page, there are lots, and it depends on the locale.
Base note:
> If your 32-bit DLL contains strings longer
> than 255 characters, then 16-bit programs
> would be unable to read those strings.
You mean 255 bytes. Depending on the actual characters, anywhere from 128 to 255 of them might be too many for a 16-bit program (when using these APIs). Microsoft still confuses "character" with "byte" too often. Now wait right there, you’re not getting off that easily.
3/19/2004 12:45 PM Steve Sheppard [MSFT]:
> I always call ’em ASCII….
Only one ANSI code page is ASCII. The other ANSI code pages are not ASCII. I hope Mr. Chen gives you a stern lecture as soon as he finishes giving himself one.
The term "ANSI" is commonly used to refer specifically to the Windows Latin 1 code page, which is also known as Windows-1252 and CP1252. It is not to be confused with ISO-8859-1, which, as I mentioned in an earlier post, contains more control codes but fewer printable characters.
It should also be noted that many Windows code pages are indeed ASCII — or, to be more precise, supersets of ASCII. ASCII is a 7-bit character set, and many Windows code pages (including Latin 1 "ANSI") simply supplement ASCII by adding an eighth bit and up to 128 additional characters.
3/21/2004 6:12 PM J. Edward Sanchez:
> The term "ANSI" is commonly used to refer
> specifically to the Windows Latin 1 code page
All through MSDN, the term "ANSI code pages" refer to all ANSI code pages.
> It should also be noted that many Windows
> code pages are indeed ASCII — or, to be
> more precise, supersets of ASCII.
Yup. Many are. Also, many aren’t.
Of those that aren’t, many come close. Here’s one example: Among all the one-byte and two-byte characters of ANSI code page 932, 126 of the values are officially the same as ASCII values, and one more value is practically the same (no one minds that it displays as a tilde even though officially it’s an overline). For practical purposes only one of the values below 127, and all of the values in several ranges between 128 and 65535, are wildly different from ASCII.
Meanwhile, "ANSI" doesn’t mean "ANSI code page 437" or "ANSI code page 850" or whichever you had in mind, ANSI code pages still mean all ANSI code pages.
Norman: Code pages 437 and 850 are IBM code pages and can be the "OEM code page" on some machines. If I understand correctly, the "OEM code page" is the one that the BIOS uses and that DOS and NT consoles use by default.
So far as I know ISO 8859-1 has nothing to do with ANSI – it is based on the DEC Multinational Character Set and Roman Czyborra says it was originally standardised by ECMA.
ASCII was of course an ANSI (or ASA as it was back then) standard, and Windows "ANSI" code pages are based on ASCII, but then so are the OEM code pages, so that doesn’t explain it either.