|Date:||April 17, 2007 / year-entry #133|
|Summary:||About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it. First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports. 8-bit...|
About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.
First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.
If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:
Depending on which encoding you assume, you get very different results.
Okay, so this file can be interpreted in four different ways.
Are you going to use the "try to guess" algorithm from
Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.
Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM.
If you poke around your Windows directory, you'll probably
find other Unicode files without a BOM.
(For example, I found
The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".
<-- Back to Old New Thing Archive Index