|Date:||March 24, 2004 / year-entry #113|
|Summary:||David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it's forced to guess. Here's the file "Hello" in various encodings: 48 65 6C 6C 6F This is the traditional ANSI...|
The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it's forced to guess.
Here's the file "Hello" in various encodings:
Notice that the UTF7 BOM encoding is just the ASCII string "+/v8-", which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).
The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.
And as the documentation notes, "Absolute certainty is not guaranteed." Short strings are most likely to be misdetected.
[Raymond is currently on vacation; this message was pre-recorded.]
<-- Back to Old New Thing Archive Index