The Notepad file encoding problem, redux

Date:April 17, 2007 / year-entry #133
Tags:other
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20070417-00/?p=27223
Comments:    65
Summary:About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it. First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports. 8-bit...

About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.

First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.

  • 8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.

  • UTF-8. These usually begin with a BOM but not always.

  • Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.

  • Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.

If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:

D0 AE

Depending on which encoding you assume, you get very different results.

  • If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "Ю". Sure this looks strange, but maybe it's part of the word VATNIЮ which might be the name of an Icelandic hotel.

  • If you assume UTF-8, then the file consists of the single Cyrillic character U+042E, or "Ю".

  • If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE, or "킮".

  • If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0, or "껐".

Okay, so this file can be interpreted in four different ways. Are you going to use the "try to guess" algorithm from IsTextUnicode? (Michael Kaplan has some thoughts on this subject.) If so, then you are right where Notepad is today. Notice that all four interpretations are linguistically plausible.

Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.

Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.

Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,

cmd /u /c dir >results.txt

This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.

The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".


Comments (65)
  1. James Risto says:

    Because I am lazy to do the research, does anyone know why this causes Notepad to think its Unicode?

    http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

    [I thought I just spent two days discussing this. It’s guessing, and in this case, it guesses wrong. -Raymond]
  2. littleguru says:

    This might sound foolish, but why does the textbox not allow to select the encoding in the context menu?

  3. Staffan says:

    My favorite notepad "bug" is that it (unlike write.exe) won’t recognize unix-style line endings…

  4. Rob H says:

    James Risto: The bytes happen to look a lot like UTF-16, presumably all enough from the same region of Unicode or close enough to fool the heuristic.

    Key features include, but are not limited to, the even number of characters and the fact that all of the spaces fall on odd characters.

  5. Puckdropper says:

    James, do the research and click on the first link:  "the Notepad file encoding problem"

    If you’re really curious, this will at least get you thinking.

  6. strik says:

    … and, not to forget, the "guessing" of Notepad which always guesses the file uses CR/LF, even if it does not.

  7. Nick says:

    One reason people are hesitant to save UTF-8 files with a BOM is that it breaks server-side includes (and partial pages included via PHP) in Internet Explorer (even 7).

    Of course, of all the major browsers, only IE chokes on the stray BOMs in the middle of the page (because they were at the beginning of the files that were server-side-included and ended up in the middle of the page).

    I predict that Microsoft will fix this some time between 2045 and the heat death of the universe.

  8. S says:

    If unsure – just ask the user and do _NOT_ try to use psychic powers.

    [Since you are almost never 100% sure, this boils down to “Always ask the user.” I bet you’d hate that. -Raymond]
  9. pcooper says:

    The problem with asking the user is that usually the user doesn’t know the first thing about character encoding systems.

  10. Adam says:

    Nick > Interesting – how does it "break" such things? A stray BOM in the page should just be interpreted as a ZWNJ. If it were in the middle of a long word that happened to fall near the end of a line, you’d get an unexpected word break, but I don’t see how that could be thought of as being particularly "broken". Moreover, I don’t see how you’d end up with two parts of a word in different files.

    Or does it output the "no glyph" box symbol for ZWNJs?

    Can you expand on this?

    But, the general problem is one of the main reasons why BOMs are discouraged. It breaks the semantics of being able to concatenate two text files together simply by appending one stream of bytes to another. You now have an extra character in the resultant stream that was not "in" the original files.

  11. BryanK says:

    pcooper — True, but depending on the dialog, they might not have to know anything about character encodings to answer it, either.

    If it’s possible to narrow the (large) number of possible encodings down to a smaller number (say two to five), it may be possible to design a dialog that would allow the user to choose an encoding from this smaller list.  Show them a preview of the text as it would display under the currently-selected encoding; then they can just switch between encodings until they find one that makes the text look right.

    I don’t think it’d be a good idea to require the user to go through this dialog whenever Notepad isn’t sure (because as Raymond says, it almost never is sure).  It’s probably a bad idea to require this dialog to come up at any point, in fact.  But it should be possible to have it as a right-click option or a menu item somewhere; that way when a file is guessed wrong, at least the user can override the wrong choice.

    (And to make it even more usable, it may be good to give the dialog an option for “save this choice with this file” if the chosen encoding will allow it — when the user chooses that, you’d add a BOM, or something, so there’s no ambiguity when reloading the file.  Or you’d add an alternate data stream to the file that acts as an encoding hint (which, since it isn’t a BOM, won’t affect concatenating files — though I haven’t thought much about that, so it may have other problems).  But if the user doesn’t have to keep bringing up this dialog every time they look at the file, that’d be good.)

    Of course maybe this is all too much complexity for a simple program like Notepad, too.

    [You already have the option to specify the encoding manually: Ctrl+O. -Raymond]
  12. Mihai says:

    I guess one of the traits of a nitpicker is that it does not not realize he is a nitpicker. For him, the point is valid and important.

    So, even risking to be classified as nitpicker, I will make two (unrelated) points:

    1. Not sure what "8-bit ANSI" is.

    In the Windows lingo "ANSI" means "the default system locale" (or "Language for Non-Unicode programs", in XP UI terminology).

    This includes double byte code pages like 932 (Japanese), 936 (Chinese Simplified), 949 (Korean), and 950 (Chinese Traditional).

    They can still be considered 8-bit if you consider the "coding unit" (Unicode terminology), but is confusing.

    In for this article, just ANSI should suffice.

    1. I agree that it is not really possible to make the detection smarter. But there are a couple of easy improvements:

      a. when I select "Save As…" and UTF-8, also give me a BOM option (I am in "help the application" mode anyway)

      b. when opening a UTF-8 no-BOM file and the encoding is properly detected, don’t add the BOM "just in case." It was probably no-BOM for a reason.

    I will not even try to suggest "if you cannot detect the encoding 100% reliable, ask the user" :-) Would be nice, but for this there are some free tools out there, and easy to find.

  13. Nick says:

    Adam–

    The IE behavior is varied and unpredictable. I’ve seen it to cause stylesheets to refuse to load unless one manually refreshes the page with an F5 (regardless if the page is already cached). It can also result in bizarre encoding vomit characters to be output, and sometimes extra line breaks. But the most annoying is the stylesheet issue.

    To be fair, some people say that PHP or the SSI on your server of choice should be able to detect the "first" include and strip the BOM from subsequent ones, but it seems like this would be hard to figure out, especially considering that output buffering may or may not be off (if output buffering is on, the order of includes is not necessarily the order they appear on the page).

    Side note:

    It should also be noted that there’s a lot of misinformation on the issue, especially since HTTP’s concept of expressing encoding is through "charset", when it’s specifying not a character set (HTML is Unicode) but instead an encoding. (I personally find the whole idea of referring to UTF-16 as "Unicode" to refer to both the encoding and the character set, as Notepad’s save dropdown does, terribly confusing. I mean, you have "UTF-8, Unicode, Unicode big endian". What? They’re all Unicode. But, to MS’s credit, I do believe that’s standard practice.)

    Some linkage from the Google:

    http://neosmart.net/blog/2006/expression-web-designer-and-internet-explorer-think-again/

    http://drupal.org/node/64155

    http://copia.ogbuji.net/blog/2005-10-20/Today_s_XM

  14. Tyler Reddun says:

    It because of this sort of issue that the XML spec is very specific about how a file begins. An XML file must start with "<xml" in the Latin character set. This means that if the file is lacking a BOM (or a mime type with the encoding) you can read the first few characters of the file and work out a few things. If it’s 8 or 16 bit encoding and the byte order of the 16-bit encoding. The only thing you can’t learn is if it’s UTF-8 or 8-Bit ANSI.

    It’s suppose to assume UTF-8 (as that is the ‘correct’ 8-bit formatting), but many parsers will downshift to 8-Bit ANSI if they find none UTF-8 strings in the file.

    Of course none of this else helps with the Notepad case, it’s just an interesting contrast.

  15. Red Armstrong says:

    If a BOM is found, then life is easy, since the BOM tells you what

    encoding the file uses.

    Actually that’s not entirely true; even if there "appears" to be a BOM, the file might be one of several ANSI codepages that have characters that can look like a UTF8 BOM.  You really don’t know, in the general case, unless you’re told what encoding it is.

    In practice you can just assume that if it looks like a BOM, it is a BOM.

  16. Gabe says:

    Although I don’t think it’s necessarily a good idea to implement this, I do believe I can think of a way to make the heuristic a bit "smarter".

    Unicode has a number of defined character ranges (some being for private use). The detector could go with the encoding that minimizes the out-of-range characters in the file. If there are "too many" invalid characters in the file, assume it’s 8-bit ANSI.

  17. Hadrien says:

    Usually I don’t care making my text files Notepad friendly by adding a BOM. As mentioned in comments above, what really upsets me is having to use U+000D U+000A instead of single U+000A for endlines :-( I didn’t try Vista, is it now corrected ?

  18. John says:

    Sometimes I wonder how different it would be if 20 years ago everybody knew the problems we had today.

  19. Dusty says:

    The reason I use notepad is that I always thought that it only edited ANSI text.  Whenever, I wanted to see a the "bytes" of a file as some characters, I would use notepad.  Kind of like a hex editor, but without the hex.  If I needed something that showed "pretty" text I use WordPad instead.

  20. - says:

    Hadrien: Of course it is not corrected. Not many long-standing bugs have been fixed in Vista, to be honest :(

  21. cmov says:

    "you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you’re going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI"

    UTF-8 is also a Unicode encoding, right? So a UTF-8 file is also a Unicode file.

    Ofcourse, the real problem here is with the handling of file types and file metadata. If a file type could be perfectly transported in a platform independent way, this problem would not exist. Unfortunately, we have kludges like using the part after a period to judge the file type. A content-type header from HTTP is never saved along the file, and it would be lost anyway when the file is transmitted to, for instance, an FTP server.

    Mac OS uses a data fork for the file and a resource fork to store metadata, but it too can’t be psychic when it recieves a "text/plain" file over HTTP. It does, however, allow the user to rename the file to anything.

    Nautilus (and AFAIK Konqueror too) sniffs the mime type from the file contents itself, but it warns about a security risk when there’s a mismatch with the filename and refuses to open it until the name matches the actual type of file.

    Now about the problem with IE and a BOM in the middle of a page… I haven’t heard of it before, but it might be avoidable by doing something like this:

    <?php

    echo ‘<!–‘;

    include(‘something.inc.php’);

    echo ‘–>’;

    ?>

    Or try it with a javascript:

    <?php

    echo ‘<script type="text/javascript">

    var x = "’;

    include(‘something.inc.php’);

    echo ‘";

    </script>’;

    ?>

    The theory, ofcourse, is that the parser is in a different state when it reads the BOM. Inside a comment, it may just skip over anything but "–>" as quickly as possible.

  22. cmov says:

    One more thing, you listed UTF-8 and UTF-16 as supported Unicode files in Notepad. However, it supports UCS-2 as well. I have converted a UTF-8 file to UTF-8 with CRLF line endings using “flip”, and then converted it to UCS-2 with “iconv”. It shows up fine in Notepad, including the special characters. I did all that because I wanted to include the file in a program as a resource.

    [Um, UCS-2 is a subset of UTF-16, so there is obviously need to call it out specifically. Hey, did you know that Notepad also supports 7-bit ASCII? -Raymond]
  23. Deus says:

    One solution that hasn’t been considered is to create a new file extension specifically for UTF8 or UCS text. You saw something similar to this during the BBS era, where text files with ANSI colour codes had the extension .ans while plain ASCII text used .asc.

    Anyway, the first thing most experienced developers who work with plaintext do when setting up a system is to install a third-party editor, due to the deficiencies of Notepad. My personal fav is Notepad++ (http://notepad-plus.sourceforge.net/) a Scintilla-based editor which supports code folder, macros and even a hex editor.

  24. Mihai says:

    <<A content-type header from HTTP is never saved along the file>>

    In fact you can use

    <meta http-equiv=Content-Type content="text/html; charset=utf-8">

    in the head section.

    Which I would always recommend.

  25. The problem with LF line endings and Notepad is that Notepad is basically a window around the Win32 Edit control. The Edit control doesn’t support LF line endings and presumably it would give backwards compatibility headaches if it started to…

  26. cmov says:

    “Um, UCS-2 is a subset of UTF-16, so there is obviously need to call it out specifically. Hey, did you know that Notepad also supports 7-bit ASCII?”

    I’m sorry I didn’t know UCS-2 is actually a subset of UTF-16. Thanks for mentioning it. No reason to act childish to me.

    @Mihai: That doesn’t help you when you save the file and try to open it.

    [Sorry, I assume people do some basic research before asking a question. If you’re going to complain about Notepad’s UCS-2 support, you probably should know what UCS-2 is. -Raymond]
  27. DWalker says:

    Raymond:  Thanks for pointing us to the "Mobile PC Team Blog" — but it’s defunct.

    There were 5 messages in September 2005, 1 in October 2005, and 1 in November 2005.  

    None since then.

    I don’t even see a way to contact the authors of the blog to ask questions — assuming they haven’t changed departments or companies since November 2005.

    I know it’s not YOUR blog, but if there’s no way for us to contact the authors of that blog, it’s kind of a useless resource.

  28. d.n.hotch says:

    In fact you can use

    <meta …

    in the head section.

    But how would you recognize it’s the META anyway? Hope for something ASCII-compatible? You might need the encoding before you get the encoding. This is a hen-egg problem.

  29. Adam says:

    Surely the problem is worse than you’re saying. If you have no idea what encoding the file is supposed to be in, why limit yourself to just Windows-1252 in the 8-bit ANSI case? The file could presumably be in *any* 8-bit character set where those two characters are defined.

  30. Tim Dawson says:

    There needs to be a metric from the IsTextUnicode (or preferably a replacement) API that gives the caller a measure of probability rather than a boolean value. If there’s a BOM, it’s pretty sure. If there’s a fair bit of text, it’s going to be pretty sure too. If it’s not sure (we want a probability score here, let’s say less than 70% for the sake of argument) then the user is prompted.

    And of course, no user wants to be prompted about code pages. My grandmother doesn’t know anything about code pages. You simply show some UI with a short preview of text for all the probable unicode options and it’s easy for them to choose which looks right by clicking on it.

    This is better than getting it wrong and displaying garbage, and allowing morons to come up with conspiracy theories.

  31. Puckdropper says:

    Now that Raymond pointed out in a comment that the Open File dialog (CTRL+O) in Notepad gives you an encoding option, I’m of the opinion that Notepad’s done the best that can be done.  Sure, giving metrics and such about how "Unicody" the file is sounds like a good idea, but there’s always going to be cases that rate highly and are the wrong choice.

  32. steveg says:

    If mixed file encodings are a regular problem while editing (programmatically it’s a whole different problem) then switch to another editor which does allow encodings to be switched easily (I use Notepad2, despite a couple of crash bugs) many others exist — I also replace Notepad.exe with the Notepad2.exe).

    And people calling for Notepad to recognise all line endings 0x0A, 0x0D, 0x0A+0x0D, 0x0D+0x0A — what should notepad do when a file contains mixed endings or new lines are added to the document? Should Save always change a file to DOS line endings? I have occasionally used Notepad to edit .exe files — which is one very good thing about Notepad, what comes in goes out verbatim. All of a sudden there are backwards compatibility issues…

  33. cmov says:

    "Sorry, I assume people do some basic research before asking a question."

    Yes, I’d assume that too. But I wasn’t asking anything. I was just pointing out something, and wasn’t aware that I had missed a memo.

    "If you’re going to complain about Notepad’s UCS-2 support, you probably should know what UCS-2 is."

    I’m not complaining at all! Why would I complain about notepad anyway? Just because I could? Maybe I could please you by trying to be a nitpicker, but I’m not bothering. As others have pointed out already, a coder always happily installs his/her favorite editor that makes notepad look like a clay tablet. Notepad is a fine clay tablet, there’s still a use for it sometimes.

  34. pcooper says:

    Really, the problem is that there’s not really a concept of a "plain text" file. *Any* file needs additional information (such as that given in the HTTP Content-Type header) in order for an application to know what the bytes are supposed to mean. Windows tries to encode that information in the file’s extension, and does a good job 99% of the time, but it’s really not enough unless we can get people to name their files file.txt.UTF-8, file.txt.windows-1252, and so on (and we add corresponding application support).

    A UTF-8 text file is a *different* type of file than a windows-1252 text file, just like they’re both different from an HTML or Microsoft Word file.

  35. Ben Bryant says:

    Amazing that Raymond does not even know that notepad auto-detects UTF-8 without a BOM – yes it actually runs through the file and if the non-ASCII byte sequences are valid UTF-8 then it knows it is UTF-8. This is a key point in this discussion that everyone seems completely unaware of. Saying notepad uses IsTextUnicode I think shows a lack of knowledge about how notepad works.

  36. Dean Harding says:

    Ben, where did Raymond say that it did not detect UTF-8 without a BOM?

    Saying that notepad does NOT use IsTextUnicode shows a lack of knowledge about how notepad (and indeed IsTextUnicode) works. IsTextUnicode is EXACTLY how notepad does its auto-detection.

  37. Adam says:

    Nick – that does sound bizarre, especially from the links posted. The oddest report of all is when it appears to output the characters that are the Windows-1252 interpretations of the UTF-8 BOM bytes – makes it sound like IE is interpreting /part/ of a UTF-8 stream as Win-1252?!?

    WRT the server-side removing the BOM, I’d say that it should strip a BOM (if present) from *all* of the files it reads. it should be sending a "content-type" HTTP header back with the correct charset for the rest of the page, so none of the BOMs will be necessary in the output.

    As for the charset/encoding distinction – most of the time it does not matter. A character encoding implicitly defines a character set; it is the set of characters expressible by the encoding. Similarly, a character set is nearly always defined in terms of a specific encoding – even Unicode is defined in terms of a specific encoding – UCS-4. Unicode is just different from most other charsets in that it defines a number of encodings that can express it.

    Also, HTML doesn’t have to be in unicode – it’s perfectly acceptable to store it in Windows-1252 and have "Content-type: text/html; charset=Windows-1252" as an HTTP header or HTML META tag. And Windows-1252 is as much a character set as it is an encoding.

  38. Nick says:

    Adam–

    Hmm. I think I do stand corrected as to my blanket "HTML is a Unicode charset" statement.

    I think what confuses the pants off me is that you can take an HTML file, save it as iso-8859-1, and then shove in an HTML entity reference relating to a Unicode code point. Like take some crazy character that’s not in iso-8859-1, like "Upwards Double Arrow" (&#8607;). 8607 corresponds to a Unicode code point that the browser looks up, but the actual &, #, 8, 6, 0, and 7 characters were transmitted to the browser in Windows Latin 1. That 8607 never changes regardless of the text file’s encoding because "that part" of HTML is Unicode. Right?

    That popping noise was my brain exploding.

  39. Miral says:

    If server-side-includes are printing BOMs into the output, then it’s the SSI code that’s broken, not whatever app is trying to parse it (although it should be able to tolerate it, under the "70% of the web is broken" rule).

    Imagine including three files, one of which is UTF-8, another UTF-16, and a third is ANSI with a Cyrillic codepage.  Whoever is constructing the output is responsible for ensuring the result is in a single unified format (so it can be specified in the headers), so they’ll have to perform the necessary character conversion before outputting it.

    In an ideal world, at least.  Of course in practice most don’t give a damn and just assume that everything is in the same encoding.  But that’s not technically correct behaviour.

  40. jondr says:

    OMG!  I just figured out why my JSP was not transforming an XML file (using either a DTD or Schema)!  Kept getting error message about having data in the XML head.  Of course: the BOM.  GAH!  

    The insanity when a dotNET nut is given the assignment to do Java!  

    So I guess the Xerces/Xanal XML parser methods in Java can’t deal with a real UTF-8 encoded XML with a BOM. That’s real convenient. Crikey, can’t even get it in to filer. And NetBeans 5.5 uses UTF-8 as default.

    Well, it is nearly 2am so I’m glad I took a break to come here.  Gotta get up in 4 hours. At least I solved the problem. Thanks. Just ignore this.

  41. Jonathan says:

    IE has its own heuristics about guess, since a lot of pages neglect to specify their charset properly (in Content-Type header of the equivalent <meta http-equiv=Content-Type…). It usually gets it right, even in the difficult case of Hebrew, where there are several charsets that support it – Hebrew-Windows, Hebrew-DOS, Hebrew-ISO-Logical, Hebrew-ISA-Visual <shudder>, and of course full-Unicode charsets like UTF-8.

    But when it doesn’t, most users have learned to change the encoding to everything with "Hebrew" in it until it looks OK. I’ve seen many otherwise non-sophisticated users accomplish this. I wish Notepad has this option (and not only during open).

    BTW, using "Unicode" when meaning "UTF-16 LE" is a historical remnant from the time where Unicode thought "16-bit should be enough for everyone!", which is about the formative years of Windows NT and Win32 (I think, don’t quote me on that).

  42. mfx says:

    So I guess the Xerces/Xanal XML parser methods in Java

    can’t deal with a real UTF-8 encoded XML with a BOM.

    The Java XML parsers can cope with a BOM, but you have to present them

    with an unparsed byte sequence. Most likely, you have unwittingly converted

    the bytes into a String (using some erroneous auto-guessed encoding like

    iso-latin-1), and presented the resulting string (which contained a

    2-characters sequence 0x00FE 0x00FF (or 0x00FF 0x00FE) instead of

    a one-character preamble 0xFEFF (or 0xFFFE).

  43. Norman Diamond says:

    You already have the option to specify the

    encoding manually: Ctrl+O.

    I do not.  The options presented are ANSI (Shift-JIS, in which a very small minority of the characters are 8 bits), and three varieties of Unicode.  If a file contains European characters then some versions of Word sometimes provide more possibilities and even provide previews, but Notepad doesn’t.  If I choose any Notepad option the result will be garbage.

    Why does Notepad call an API that has such a heavy bias towards guessing a usually wrong encoding?  Most files are Shift-JIS, a few are other ANSI encodings (some of which are 8-bit encodings), and a few are Unicode.  If there’s no obvious indicator then the first attempt should be the user’s ANSI code page, and if that doesn’t work then the second attempt should be the system’s ANSI code page, and Unicode should come after that.

    Tuesday, April 17, 2007 2:25 PM by John

    Sometimes I wonder how different it would be

    if 20 years ago everybody knew the problems

    we had today.

    Everybody did, except for one country.

  44. Doug says:

    "Yes, I’d assume that too. But I wasn’t asking anything. I was just pointing out something, and wasn’t aware that I had missed a memo."

    You must be new here.

  45. Nick Lamb says:

    To explain the references to HTML being Unicode earlier, the situation is that the HTML standard is (these days) explicitly Unicode. Thus when you send a Windows 1252 HTML file, the HTML standard considers that as a Unicode document which merely happens to be encoded as Windows 1252, just as other Unicode documents might be encoded as UTF-8 or UCS-2. There are a great many corner cases that are simplified by this assertion.

    The practical consequence is that you must program for the web using Unicode, because it’s virtually impossible to implement the standard correctly without doing so. Most web browsers and similar tools convert to their internal Unicode representation (usually UTF-8 or native endian UTF-16) during or soon after downloading.

  46. Ben Bryant says:

    IsTextUnicode is EXACTLY how notepad does its auto-detection

    Dean, what a completely false statement! As far as I can tell, if IsTextUnicode is used at all, it is ONLY used to detect the UTF-16 BOM. It doesn’t even detect a UTF-8 BOM! And it doesn’t have UTF-8 byte sequence auto-detection. None of the IsTextUnicode UTF-16 statistical likeness detection is used because that would give notepad a behavior that could be deemed inconsistent.

    All in all, notepad makes the right choices, but IsTextUnicode has little to do with it.

  47. Ben Bryant says:

    Oh, I see it actually uses IsTextUnicode for UTF-16 auto-detection, I was wrong, I missed the memo. All the testing I did was in relation to UTF-8 and I shouldn’t have assumed.

  48. Ben Bryant says:

    But why is there no discussion here of the UTF-8 auto-detection which is not from IsTextUnicode and is much more statistically reliable but still not 100%? I wonder if it is being removed in Vista.

    The same way you can trick notepad into thinking ‘this app can break’ is UTF-16, you can make it think ‘©’ is UTF-8. On a machine with Windows-1252 locale, open notepad, paste © in and save. Re-open you’ll find it is just ©. Save it and it will add the UTF-8 BOM (use file Properties to see it go from 2 to 5 bytes).

    I am amazed that notepad would make such a blunder to use IsTextUnicode for UTF-16 auto-detection because it is so obviously unreliable I would never have imagined the notepad team could be so dumb. You can specify the encoding on File Open for unusual cases of UTF-16 without BOM! Why would they do this?!!

    [I took extra time in the article to discuss this; please read it again. -Raymond]
  49. Ben Bryant says:

    Thanks for responding to say re-read. You did touch the important points, but I guess I’m adding:

    a) notepad’s UTF-8 auto-detection is not from IsTextUnicode. I guess you were referring to this auto-detection where you said "looks like valid UTF-8"

    b) UTF-8 auto-detection is *much* more reliable (despite the example I gave which breaks it), so it can be seriously considered, whereas UTF-16 auto-detection should not be even remotely considered, especially since the BOM is always recommended for UCS-2 and UTF-16.

  50. Dileepa says:

    In Windows XP, if I press Windows+TAB to ‘tab’ through the buttons on the taskbar and then press the ‘space’ key on one of the taskbar buttons, I see a big stripe (the entire length of the taskbar and about the same height) covering the taskbar. Is that a bug or a feature?

    (I don’t know if this happens in Vista non-aero mode too).

  51. Mihai says:

    @cmov: "That doesn’t help you when you save the file and try to open it."

    True. But as a nice bonus: the Visual Studio 2005 editor does. And it also saves the file according to the meta.

    @d.n.hotch: "Hope for something ASCII-compatible? You might need the encoding before you get the encoding."

    All the popular code pages are ASCII-compatible enough for you to recognize the meta. Exceptions are EBCDIC (and using EBCDIC in an HTML is a WTF :-), and UTF-16/UTF-32

    For UTF16/32 the recommendation is to use a BOM: http://www.w3.org/TR/html401/charset.html#h-5.2.1.1

  52. Ctrl+O sux says:

    You can’t specify encoding when drag’n’dropping files on notepad!

  53. James Risto says:

    OldNewThing, with all due respect, I think you need to thicken your skin a bit. You are a public figure now. Those of us less knowledgeable just want to understand … some of us are NOT here to MS-bash … just MS-understand.

    [Oh great, now I’m a public figure. Remember when blogging was about just being yourself? How can I stop being a public figure? -Raymond]
  54. A few days ago, Raymond was talking about the Notepad file encoding problem , again. And the comments

  55. Private figure says:

    How can I stop being a public figure?

    Use a time machine and shoot yourself before you start this blog.

    [Nitpicker’s corner:  The above is not an order.  In fact it’s not even a serious answer.]

  56. James Risto says:

    My apologies if your questions were rhetorical, but if they were not … yes somehow being senior in MS, and writing a book, your "inside knowledge" and sharing level, and "tight ship" blog has made you more than a casual blogger. Nobody wants to be a public figure (for long). It should not be, but what we want and reality are two different things. Feel free to ignore this, and if I am wrong or irrelevant then no loss.

  57. Jon Asbury says:

    My favorite notepad "bug" is that it (unlike write.exe) won’t recognize unix-style line endings…

    Or Mac line endings – CRs with no LF

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index