Why can’t you say </script> in a script block?

Date:June 5, 2006 / year-entry #188
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20060605-00/?p=30983
Comments:    61
Summary:Because it ends the script block, of course. Duh, what's so hard about that? Because if you have script that generates script, you'll find yourself caught out if you're not careful. For example, you can't say document.write(""); in a script block because the HTML parser will see the and conclude that your script block...

Because it ends the script block, of course. Duh, what's so hard about that?

Because if you have script that generates script, you'll find yourself caught out if you're not careful. For example, you can't say

document.write("<SCRIPT>blahblah</SCRIPT>");

in a script block because the HTML parser will see the </SCRIPT> and conclude that your script block is over. In other words, the script block extends as far as the highlighted section below:

<SCRIPT>
document.write("<SCRIPT>blahblah</SCRIPT>");
</SCRIPT><!-- mismatched tag -->

The parser doesn't understand "quoted strings" or "comments" or anything like that. It just looks for the nine characters "<", "/", "S", "C", "R", "I", "P", "T", and ">". When it sees them, it decides that the script block is over and returns to HTML parsing.

Why doesn't the parser understand quoted string?

Well, in order to parse quoted strings, you have to be able to parse comments:

<SCRIPT>
/* unmatched quotation mark " ignored since it's in a comment */
</SCRIPT><!-- you might expect this to end the script block -->

But every language has a different comment syntax. JScript uses /* ... */ and //, Visual Basic uses ', perl uses #, and so on. And even if you got comments figured out, you also would need to know how to parse quoted strings. Perl, for example, has a very large vocabulary for expressing quoted strings, from the simple "..." and '...' to the idiosyncratic qq:...:. And I lied about the JScript comment and quotation syntax; it's actually more complicated than I suggested:

/"//"</SCRIPT>is this inside or outside quotes?

That first quotation mark is itself quoted and does not count as a "beginning of quoted string" marker. And the // sequence is not a comment marker. The first slash in the // sequence ends the regular expression, and the second is a division operator.

It would be unreasonable to expect the HTML parser to be able to understand every language both present and future. (At least not until clairvoyance has been perfected.)

<SCRIPT>
'is this a quoted string?'</SCRIPT>
Is this inside or outside the script block?
'<SCRIPT>' is this a new script block
or the continuation of the previous one?
</SCRIPT>

One "solution" would be to require all languages to conform to one of a fixed number of quotation and comment syntaxes. Nevermind that not even JScript conforms to the basic syntax, as we saw above, thanks to the complicated quotation rules implied by regular expression shorthand. And do you really want all HTML parsers to understand perl?

Another "solution" would be to have the language processor do the parsing and tell the HTML parser where the </SCRIPT> tag is. This has its own problems, however. First, it means that the HTML parser would still have to load the language parser even for DEFER script blocks, which sort of defeats one of the purposes of DEFER. Even worse, it means that a web page that used a language that the system didn't support would become unparseable:

<SCRIPT LANG="unknown-language">
Lorem ipsum dolor sit amet,
...

If a language parser were required to locate the end of the script block, it would be impossible to parse past this point.

So how do you work around this aspect of HTML parsing? You have to find an alternate way of expressing the string you want. Typically, this is done by breaking in up into two strings that you then reassemble:

document.write("<SCRIPT>blahblah</SCRI"+"PT>");

Comments (61)
  1. And of course the last is how the MySpace "worm" worked (it wasn’t a worm, it was a self-propogating cross-site-scripting exploit).

  2. Einars Lielmanis says:

    Or, alternatively, you could just hide the script in the HTML comments,

    <script type=”text/javascript”>

    <!–

    document.write(‘<script>alert(“Rules!”)</script>’);

    –>

    </script>

    [That just swaps out one problem for
    another. Then the title of this article becomes “Why can’t you say
    –> in a script block?”: <script><!– document.write(“Click
    here –>”); … –></script>
    -Raymond
    ]
  3. Oh please. Don’t post something like this if it’s wrong!

    In HTML it’s illegal to use the ‘<‘ character in plain text.
    You always need to escape it as ‘&lt;’. (I hope this survives
    the comment form). Just like in XML and XHTML. It has never been legal.
    And all HTML parsers are required to transcode it properly.

    So do use &lt;/SCRIPT>. Also use &lt;SCRIPT>
    for that matter, and use “if (5 &lt; 10) …”. Everything else
    is not HTML. It will work because the browsers know that HTML authors
    are ignorant, but as this is a non-standard fixup, you can’t rely on it
    working and maybe the fixup starts to work in a different way later on?
    Then don’t complain about these browsers all rendering your page in
    wrong way …

    Nice issue with your headline though – the day all earth comes down
    in ashes, the sole thing surviving will be a wrongly double-escaped
    string ;-)

    [Fixed the headline. Stupid autoposter. I need to talk to the person who wrote it. Oh wait, that’s me. (And I fixed your < characters.) -Raymond]
  4. Adam says:

    Except that "<!–" isn’t valid javascript.

    If you want a safe alternative, use an external script file:

    <script type="text/javascript" src="script.js" />

  5. Yoz says:

    Just to add to the cavalcade of "except that" or "alternately" comments – document.write() is not just an outdated method, it actually won’t work on proper XHTML documents. See here for more info:

    http://ln.hixie.ch/?start=1091626816&count=1

    Instead, you should remove the need for writing </SCRIPT> by not actually writing the raw HTML in your code, and creating the node the proper DOM way:

    var scriptNode = document.createElement("script");

    scriptNode.setAttribute("type", "text/javascript");

    etc.

  6. BryanK says:

    Oops, my previous reply was to Einars Lielmanis.

    Other stuff posted since I started writing that:

    Adam:  Right idea, but your self-closing script tag doesn’t work in IE6SP1 (not sure on SP2).  For some reason, that browser requires two separate tags; otherwise it won’t "see" the script.  You can’t combine them like that, even though there’s no content and IIRC XML says you should be able to in that case.

    And Martin Probst: not all browsers actually parse entity references in script code, even though the content of the script tag is supposed to be "parsed CDATA".  I know some don’t change &amp;&amp; into && before passing it to the script engine, for instance.  I’m not sure about &lt; though — it may work.

    OTOH, external scripts "always" work.  If you need to handle events, hook them up in window.onload inside the external script, and unhook them in window.onunload if you need to do that to prevent a memory leak.

    Yoz:  Yep.  I’m not sure how to add content to that script tag in the DOM, though — perhaps:

    var txtNode = document.createTextNode("script code here");

    scriptNode.appendChild(txtNode);

    would work?  Never tried it (my scripts have always been "static"; I’ve never tried creating one from another script).

  7. Grant says:

    [So how do you work around this aspect of HTML parsing? You have to find an alternate way of expressing the string you want. Typically, this is done by breaking in up into two strings that you then reassemble:

    document.write("<SCRIPT>blahblah</SCRI"+"PT>");]

    Another solution, recommended by comp.lang.javascript, is to escape the forward slash in the </SCRIPT> tag:

    <script type="text/javascript">

    document.write(‘<script type="text/javascript">alert("hi");</script>’);

    </script>

    It should work in all web browsers.

  8. James Hart says:

    According to the HTML language spec (http://www.w3.org/TR/html401/appendix/notes.html#notes-specifying-data), it’s not just </script> that should break out of a script element – it’s the sequence ‘</’. They give the following as an example of script that won’t work:

       <SCRIPT type="text/javascript">

         document.write ("<EM>This won’t work</EM>")

       </SCRIPT>

    In reality, this script actually -does- work, at least in IE6 and Firefox…

    What’s worrying, to my mind, though, is that following the absolute letter of the HTML spec, the character sequence ‘</’ is simply illegal in a <script> element. Bad news if you want to specify a Javascript regex literal that matches strings ending in a less-than character.

    In the course of investigating this I found out that the following actually appears to be syntactically valid Javascript, although it is, of course, semantically utter nonsense:

    <script>

    var a = / </i> 1;

    document.write (a);

    </script>

    Imagine you’re an XML-ish parser. That does look -awfully- like a well-formedness-breaking closing </i> tag nested inside that <script> element, doesn’t it..?

  9. Adam says:

    BrianK: Oh, thanks for that. Unfortunately, I can’t test for and work around every browser incompatibility there is, especially for browsers that aren’t available on my platform. The empty <script> tag works fine in Firefox and Konqueror, and is fine according to the w3c validator, so that’s good enough for me.

    I’m also able to really not care as my site is perfectly functional if you have javascript turned off (or just not available), so the worst that will happen for IE users is that they won’t get some of the non-content-related-but-flashy doodahs.

    I guess users of Explorer had better keep putting that pressure on their vendor to improve support for the spec if they want their doodahs though! :)

  10. CDATA is the way to go…

    Q: How do you embed ]]> in a script block?

    A: ]]&gt;

  11. BryanK says:

    Maurits: But if it’s CDATA, is the &gt; going to be parsed as an entity reference and replaced with the appropriate characters before being passed on to the script engine?  Based on what I know of XML, I don’t think it will be.

    Adam: Yeah, I wish I didn’t have to do that either; it’s several characters that I wouldn’t have to type every time I refer to a script.  Unfortunately for me, everyone in the company uses IE6 SP1 or SP2, and they’d get a little annoyed if the flashy stuff that I told them I was doing didn’t show up.  Even if it wasn’t required to use the site (and it isn’t), they’d still be annoyed.

    (Actually, I’m not sure whether <script /> works in the IE7 betas either.  Haven’t tried it.)

  12. Well, there’s this way too:

    http://en.wikipedia.org/wiki/CDATA_section

    <![CDATA[foo]]]]><![CDATA[>bar]]>

    becomes: foo]]>bar

  13. Sven Groot says:

    var scriptNode = document.createElement("script");

    > scriptNode.setAttribute("type", "text/javascript");

    Actually, that also won’t work in XHTML, provided you did use the application/xhtml+xml MIME-type. To complicate matters further, you must use document.createElementNS and pass the proper XHTML namespace, document.createElement won’t work in a true XHTML-compliant browser.

    And it’s not necessary to use setAttribute, you can just do scriptNode.type = "text/javascript" (that’s a DOM Level 1 attribute).

    And BrianK, I expect using createTextNode should work for setting the script, although I have not actually tried this myself.

  14. BryanK says:

    Except that’s not going to work according to the XML standard:

    http://www.mit.edu/~ddcc/xhtmlref/text.html

    "Encasing scripts and style sheets in comment delimiters (<!– –>) does not officially work. According to the W3C, the parser may remove all comments before passing the code onto the user agent. In addition, C-like languages, including Javascript, have a decrement operator ("–") that just happens to be the SGML comment delimiter."

    Also:

    "Interestingly, XML has a special construct designed to deal with the script and style sheet problem. Anything wrapped between "<![CDATA[" and "]]>" is treated as CDATA. Thus, using the same example, the fragment of code could be rewritten this way:

    <script type="text/javascript"> <![CDATA[ if (h && i) j(); ]]> </script>

    The problem with this solution is that not many browsers understand this synatx either. You might try wrapping the CDATA markers inside comments. (Use the comments of your scripting or style-sheet language, mind you. If you use the SGML-style comments, all sorts of nastiness may ensue.) The other problem is that if your script or style-sheet actually contains the sequence "]]>," you’re out of luck again."

    And the last quote from that page:

    "Lastly, your best solution may be just to use external scripts and style sheets, avoiding this whole big mess."

    Which is what I do.

    (This is also part of the "Unobtrusive Javascript" idea, which holds that putting *any* script code inside your HTML file is a mistake.  This is for the same reason that using *any* inline style attribute is a mistake — if you want to change the style (or the code), you’ll potentially have to edit all your HTML files, instead of the stylesheet (or script file).)

    <script type="text/ecmascript" src="blah.js"></script>

    works just fine.  ;-)

  15. Sebastian Redl says:

    Still various errors here. The HTML 4.01 spec says:



    Although the STYLE and SCRIPT elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element’s content. In valid documents, this would be the end tag for the element.

    Entities must be treated as raw text; &lt; in a script block is just &lt; to a script, and any browser replacing the entity is not conformant.

    To the HTML spec, at least. In XHTML, the content type of the script element is PCDATA, so there markup and entities do get parsed. Great, huh?

    I agree with previous comments: just make the script external.

  16. A/C says:

    Wouldn’t the parser start a new script block at the document.write("<SCRIPT>… and then end *that* one when it finds the </SCRIPT> ?

  17. Blog-a-Styx says:

    Un petit article sympa que je viens de lire sur un nouveau blog trouv&#233; ce (dur) matin&#160;: celui de Raymond…

  18. Jack V. says:

    If you were changing the script tag syntax, surely the simplest mod would be to be able to specify the closing tag, a la << in perl. Eg.

    <script close="THISISREALLYTHEEND">

    // </script> That didn’t matter

    THISISREALLYTHEEND>

    You might or might not specify where the close appear on the line, or if you can/must have a </script> after it as well.

    But the HTML parser only needs to know one new thing, and everyone who invents a language where THISISREALLYTHEEND is the assignment operator can just choose another string, maybe a multiiline string.

  19. silkio says:

    To say that you can’t determine the language of the script and hence the comment/quoting style is pretty lame.

    Clearly IE makes assumptions about the script type ANYWAY, so why not just use that format (the assumed script language) to decide what quotes/comments are?

  20. Neil says:

    Don’t forget that each external script file is a synchronous network request that must be processed before parsing of your page can complete (document.write can’t be deferred).

  21. Adam says:

    Jack, Silkio: Consider applications that want to parse HTML that aren’t full-blown browsers. Or even browsers that don’t support scripting. Do you think all browsers should have to fully parse even JavaScript just to be able to find the end script tag, even if they’re not going to do anything with the script?

    HTML is NOT a programming language. It is a document markup language. A parser should be able to determine where the markup sections start and stop with /relative/ simplicity.

    On top of this, even according to the HTML specs (particular wording here taken from HTML 4.01 Appendix B.1 at http://www.w3.org/TR/html4/appendix/notes.html but most versions of HTML have a section like it) a user-agent should be able to handle markup it doesn’t recognise:

    * If a user agent encounters an element it does not recognize, it should try to render the element’s content.

    Of course, rendering content shouldn’t apply to elements in the <head> of a document, but such a user-agent should still be able to reliably find the end of the element it needs to ignore.

  22. Adam says:

    Neil: The external script file (as with external style sheet files) should be cacheable, even if the rest of your site isn’t. e.g. if it’s some kind of database-backed shopping site, etc…

    Using external script files can decrease your total bandwidth usage quite a bit, and may well speed up all page views to your site bar the first if the script(s) are large.

  23. Gabe says:

    Jack, in order for the "heredoc" (the name of the Unix shell feature that Perl includes) paradigm to work, it would need to have been included in HTML back in the early 1990s. Why?

    Well, an HTML parser that doesn’t know about scripts would never know to look for the sentinel at the end of the script. The only way for the naive parser to be able to find it is if the ‘close’ attribute were already defined as an option on every single element. That way a parser would automatically look for a ‘close’ attribute on every tag, whether it understands the tag or not.

    Similarly, a ‘render=false’ option would have been nice also. That way in-line scripts and stylesheets would have a way to indicate to downlevel browsers that they should not render the contents of their tags.

  24. silkio says:

    Adam:

    Like I said, if the browser chooses not to process the javascript, then there is no issue. The only confusion occurs is where PART of your javascript is processed due to a script tag in the middle of it.

    I’m not asking non-"script" processing browsers to start processing it, I’m saying that if they DO process the script, and DO execute part of it, we have a right to be a little upset that they decided to be ignorant about the script tag embedded inside.

  25. Gabe says:

    Sorry silkio, but it still won’t work. Any browser that doesn’t know your scripting language won’t know when to stop parsing as script and when to start parsing as HTML again.

    A browser that doesn’t understand ANY scripts will render them all as text anyway, so it won’t care about whether the </SCRIPT> tag embedded in them should be rendered or not.

    However any browser (Lynx comes to mind) that knows about scripts will never want to render the content of the script block whether it knows how to parse the language or not. You want to be able to write:

    <script language="PerlScript">

    # this is a <script></script> block

    document.write(qq!<script>$script</script>!)

    </script>

    Unfortunately, only browsers that understand PerlScript know how to parse it properly. All others would show "block document.write(qq!!)". Since PerlScript it a pluggable script engine, my browser understands it but yours might not.

  26. Adam says:

    Silkio:

    But this isn’t just about what browsers that do understand javascript have to do.

    You can’t say "</script>" in a script block because browsers/parsers/applications that don’t understand scripts still have to be able to tell where the end of the script block is so they can process the rest of the page correctly.

    If you allow "</script>" in the script block in any form (either in a literal string, or in some other case) then all these other programs need to understand enough javascript to be able to spot a literal string, and a regular expression, and a comment, etc, etc, etc… Basically, they need to fully understand javascript in order to find the "real" end of the script.

    For this reason, you cannot allow "</script>" tags inside a script, in order for non-script processors to be able to understand HTML.

    Therefore, even browsers that do understand scripts cannot allow this either. If such browsers did support embedded "</script>" tags, people would write code that did that, test it in their browser, see that it worked as expected and assume it was fine. However, non-script-enabled browsers would just break, having found the "</script>" tag that they thought was the end and trying to process the rest of the script as HTML.

  27. silkio says:

    Gabe/Adam:

    What I’m saying is that if the browser processes the script at all, it can’t pretend that it doesn’t know how to find comments/text.

    That is to say, the following code:

    ============

    <script>

    alert(1);

    //</script>

    alert(2);

    </script>

    =============

    should produce either:

    – alert 1

    – alert 2

    or

    – NOTHING.

    i.e, the processor has already figured out what language it is, so why does it sit back and declare "oh no, i don’t know how to find comments, i’ll just end now."

    and about lynx; even though it’s textbased it still needs (or at least should) process script … obviously not all script is for visual purposes.

  28. Adam says:

    Silkio> "and about lynx; even though it’s textbased it still needs (or at least should) process script … obviously not all script is for visual purposes."

    Ah – now I see where our differences lie. :)

    This is the statment I disagree with. And I’m not conviced why it should be the case. IMO, an HTML parser should be able to parse HTML without having to be able to parse JavaScript too. Why do you think otherwise?

  29. Adam says:

    Also, not all HTML parsers are in web browsers.

    What about spiders, like googlebot? Should that have to be able to parse javascript so that it doesn’t think you have the text "alert(2);" in your web page?

  30. silkio says:

    Adam:

    I don’t think browsers should have to parse javascript or any script. What I’m saying, though, is that if they TRY, then must assume a certain type of script to do so.

    IE will assume javascript, if it’s not specified.

    For example, the following won’t popup a message box in IE unless "type=’vbscript’" is specified:

    ============

    <script>

    MsgBox("1")

    </script>

    ============

    I still think you are missing my main point … that if the browser DOES try and guess the language (which IE clearly does) then it’s a lie to say you don’t know how to resolve comments and strings.

    Spiders will need to process javascript anyway, but if they don’t, that’s totally fine, they can just find the </script> block where it lies. Other script-aware HTML parsers have no programmatic excuse for acting so ignorant.

  31. Adam says:

    Silkio> "Spiders will need to process javascript anyway …"

    Why? Please explain the logic underlying that conclusion.

    Silkio> "…but if they don’t, that’s totally fine, they can just find the </script> block where it lies."

    Huh? But if you have an embedded </script> tag, that’s the one that non-javascript-aware HTML parsers will hit. They can’t find the proper end of the script block. That’s the whole point! That’s why embedded </script> tags must be disallowed.

  32. Adam says:

    I give up.

    I’ll just say that I’d hate to try to write an HTML parser one weekend in a world where you controlled the HTML standards. :)

  33. Ross Bemrose says:

    "Consider the page consisting of:

    <script language="javascript">document.location = ‘realhomepage.html’;</script>"

    You don’t need Javascript to do redirects.  Use either an HTTP header or a <meta> tag to redirect instead.

  34. silkio says:

    Adaim:

    :|

    > Silkio> "Spiders will need to process

    > javascript anyway …"

    >

    > Why? Please explain the logic underlying that

    > conclusion.

    Consider the page consisting of:

    <script language="javascript">document.location = ‘realhomepage.html’;</script>

    my point was not "All bots should process javascript so that can discover valid ending script tags" it was "a smart bot will process javascript [which isn’t really relevant to our discussion anyway]".

    > Silkio> "…but if they don’t, that’s totally > fine, they can just find the </script> block > where it lies."

    >

    > Huh? But if you have an embedded </script>

    > tag, that’s the one that non-javascript-aware

    >  HTML parsers will hit. They _can’t_ find the

    >  proper end of the script block. That’s the

    > whole point! That’s why embedded </script>

    > tags _must_ be disallowed.

    Because different parsers may see different html output based on what scripting language they support? Sure, I agree. That’s seems like an okay reason to disallow it … not great, but …

    All I was initially trying to say is that Raymond’s comment that he doesn’t know HOW to find the script tag is not true. IE *can* figure out if it’s a valid ending tag if it wanted to, but it doesn’t want to. It’s not a programming problem (like Raymond was trying to say) it’s a logical one … :)

  35. silkio says:

    Ross: Yes, but that’s not at all the point.

    Adam: :) Come on now, I’m really not trying to say that all parsers
    have to implement script-parsing, I am just saying that if they do,
    they should be able to detect a script tag.

    [Where is the /script tag in the following:

    <SCRIPT LANGUAGE="leandra">
    .</SCRIPT>'</SCRIPT>"</SCRIPT>;(</SCRIPT>)
    </SCRIPT>#</SCRIPT>
    </SCRIPT>#</SCRIPT>

    How would you write the parser that figures out which of those “</SCRIPT>”s is the real /SCRIPT tag? -Raymond]

  36. silkio says:

    Raymond: What do you mean? I don’t know “leandre” language, so I would take the first one.

    If I did know leandra language, I would try and parse that script
    using it’s grammar. If I failed (i.e invalid token or something), i’d
    take the next /script tag at the index of where i failed. if I passed,
    I’d know where it ends and also take the next /script tag.

    ?

    [And that’s my point. You don’t know leandra. The browser doesn’t know leandra. How should it know which is the real /script tag? -Raymond]
  37. silkio says:

    If it doesn’t know leandra it takes the first tag.

    This is what I’ve been saying all along. If you can’t/dont want to process the language, then it doesn’t need to try.

    [In
    other words, you are okay with a web page parsing in two completely
    different ways depending on whether the browser supports the leandra
    language. Good luck writing an HTML validator! And if the browser does support it, it’ll have to load the leandra parser in order to find the correct end tag, which completely ruins the DEFER attribute. Wait a second, i’m just repeating myself. This is exactly what I wrote in the main article! -Raymond
    ]
  38. silkio says:

    Raymond:

    In the OP you said it wasn’t possible. Clearly it is possible; that’s all I was trying to say.

    I’ve never used the “DEFER” attribute so I’ll take your word that it will become useless.

    I don’t mind the way it currently works, [not that I’d expect a
    chance if I didn’t] and I don’t neccessarily think that it would be bad
    thing to ask the language for the ending /script tag. It seems to be
    the most intuitive.

    Either way, all I was trying to point out is that your OP is wrong …

    [Allow me to clarify for nitpickers: It’s not possible without sacrificing obvious principles like “It should be possible to write HTML such that every browser will agree on how it is parsed.” Because if you lose that, then how can you write HTML? -Raymond]
  39. AC says:

    Silkio: Are you under some kind of delusion that javascript is either part of the HTML spec, or that HTML has some kind of special relationship with javascript that it doesn’t have with other scripting languages (such as leandre)?

  40. silkio says:

    AC: no.

  41. silkio says:

    AC.

    As I said in my earlier posts, I don’t think this is the best way for the parses to operate; I was only commenting on Raymonds wording in the OP.

    To answer your question though; the processing of javascript can change the document that the parsers parse anyway, so what’s the big deal?

    Both parsers are correct; one just implements JS and is able to understand the document better.

    [I stand by my original statement: “If a language parser were required to locate the end of the script block, it would be impossible to parse past this point.” If a leandra parser is required in order to parse past the <script language=”leandra”> block and you don’t have a leandra parser, then you can’t parse any further. The statement is practically a tautology. -Raymond]
  42. silkio says:

    Raymond: What’s wrong with this process:

    – get script language
    – if known lang, request /script index
    – if unknown lang, /script index = first from this point

    ..?

    (Other then the fact it will lead to different parsing structure depending on known languages.)

    [That’s like asking, “What’s wrong with this key aside from the fact that it doesn’t open the lock?” -Raymond]
  43. AC says:

    OK. If that particular script does nothing than execute those two alerts, and does not change the document in any way, and assuming that the rest of the document is otherwise valid, would you say that the document should be considered valid HTML 4?

    If you think the document is valid HTML 4, how can you claim that the non-js-aware parser is correct if it misidentifes the document as being invalid?

    (And as an aside – what would be a good way for the non-js-aware parser to operate?)

  44. silkio says:

    Raymond:

    Ok. We disagree then.

    AC: As I said, this method has issues, but it’s not impossible. Clearly the fact that the tags you choose to use are up to the script will then change the way your document validates. I don’t actually think it’s bad that document validation depends on script; browsers interpret script as well, and you can currently easily make a valid document invalid with the simplest bit scripting.

  45. AC says:

    Humour me. This exact document:

    <!DOCTYPE HTML PUBLIC

       "-//W3C//DTD HTML 4.01//EN"

       "http://www.w3.org/TR/html4/strict.dtd"&gt;

    <html>

    <head>

    <title>test</title>

    <script>

    alert(1);

    //</script>

    alert(2);

    </script>

    </head>

    <body>

    <p>test</p>

    </body>

    </html>

    Should it be considered valid HTML? Yes or no?

  46. AC says:

    Doh. Missed ‘type="text/javascript"’ attribute from the <script> tag. Pretend it’s there.

  47. silkio says:

    AC: Well it depends if the validator implements the javascript language, doesn’t it :)

    I am the validator, and I happen to understand javascript, so yes, I will call that document valid.

  48. AC says:

    Therefore, are you saying that an HTML parser that correctly follows every last paragraph of the HTML spec, but is not js-aware, is an incorrect HTML parser[0]?

    Don’t you think there’s a slight internal inconsistency with that?

    (Have you ever been to Milliways restaurant?)

    [0] Or do you think that the correctness of a parser can be measured other than by whether it follows the spec it is implementing, and whether it can determine if a document also follows that spec?

  49. GregM says:

    No, it doesn’t depend if the validator implements the javascript language.

    The definition of HTML 4.01 (as defined in (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;) does not understand javascript.

    Therefore, by definition, this is not a valid HTML document.  Period.  End of story.  No amount of asserting that the parser understands multiple languages will make it a valid 4.01 document.

    It may be a valid Silkio-browser document, but it is not a valid HTML 4.01 document.

  50. silkio says:

    Guys:

    You’re not getting it. I’ve said in almost every post of mine that this method has issues. I agree that a non-js-parsing validator will see this documetn as invalid. That’s what I said would happen.

    I even said I don’t think this is a great way of doing things; but on othe other hand, is it so bad that scripting languages – something which can CURRENTLY affect the structure of a doc – become a part of the document validation process? I don’t think so.

    OF COURSE this method does not meet the current HTML 4.01 spec.

    To clarify: If “silkio-validating” was implemented, the spec would need to be changed.

    [Today, script cannot affect the parsing of a document. Sure, running the script may modify the document, but that’s not at issue here. The issue here is parsing. And if parsing is dependent on something outside the HTML author’s control, how can an HTML author write markup with any confidence? -Raymond]
  51. silkio says:

    Raymond: Right.

    The suggestion here is that "parsing" of a html document should include executing and processing of the script areas.

    (Yes, there are issues with this. Putting it lightly…)

  52. AC says:

    HOW?

    I’m trying to be hypothetical and move away from the current spec into a "how would you have this work then" kind of vibe, and trying to understand where you’re coming from, but…

    How would you change the spec so, such that a validating parser could possibly be written without being internally inconsistent?

  53. silkio says:

    AC:

    By having the spec include a spec for processing javascript? But then you’d have to have specs for all scripting languages …

    So the spec could just say : "execute and process script to get actual page structure".

    Then each validator/"parser" can implement as many scripting languages as it likes, and validate away, meeting the spec!

    If it doesn’t have any scripting langauges, then that’s fine, it can’t get the desired page structure and will call the doc invalid due to unknown script. [of course it can still display as much of the doc as it could].

    Obviously all hypothetical, and written on the fly, so give me a break if I made a typo/omission :)

    Hopefully it’s clear …

  54. AC says:

    Oh.

    So in that case, for the following HTML that you posted:

    <script>

    alert(1);

    //</script>

    alert(2);

    </script>

    You think that:

    1) A js-aware HTML parser should see the first </script> as a comment, the "alert(2);" as part of the script, and the second </script> as the end-of-script tag.

    2) A non-js-aware HTML parser should see the first </script> as the end-of-script, the "alert(2);" as text in the enclosing element (should be the enclosing head element, where text is not allowed), and the second </script> as an invalid close tag that doesn’t match an open tag.

    Is that correct? Two validating HTML parsers, which both fully support the HTML 4 spec, should be able to treat the same HTML document differently, and for one to find the document valid, and the other to find it broken?

    Which parser is correct? Both? Is one more correct than the other? If one *is* more correct than the other, why should the "less correct" parser not need to be fixed to be "more correct"?

  55. silkio says:

    AC: [if you can still read after that exploison].

    The spec isn’t change so that it’s impossible to reliably pass. You can still reliably find the script blocks of a document. All you do is write your parser so that it does NOT implement a scripting language, and simply goes for the first  /script it finds.

    If you DO choose to write a parser that understands a given script language, your parser just must request the ending script tag from the appropriate language parser.

    I don’t see how this suggests a spec which is impossible to reliably parse.

  56. silkio says:

    BryanK:

    Different “types” of parses will get different results. Javascript-enabled parsers will get better results.

    Think about what the spec is trying to do. Scripts have a special
    ability in HTML; they can modify the code that is parsed. They can
    create more, or delete some, or change parts. This affects the end
    result to the user, and the reason we even have validators is so that
    the end user sees the same thing everywhere [in all browsers].

    The point is, by processing/parsing the SCRIPT of a document, you can gain a better understanding of it.

    This can only be a good thing.

    You are saying it’s bad because a given validator understands a given doc better then another; I say, what’s wrong with that?

    It’s hardly a crime to have a tool that gives more accurate results!

    [What about a language where it is legal to just say “</script> outside of quotation marks? The parser for that language would never find the end of the script block since any time it saw “</script>” it would say, “Oh, yeah, that’s legal in my language. I’m still parsing.” -Raymond]
  57. AC says:

    So, you are actually saying that you want to change the spec so that it’s impossible to write a parser that can reliably tell which parts of the document are _not_ script, and therefore have _no way_ of knowing which parts of the document it is even _meant_ to process.

    No, I can make that simpler. You are actually saying that you want to change the spec so that HTML is impossible to reliably parse.

    And you don’t think that just having to break up "</script>" in a string literal or comment is the simpler and more sensible thing to do? Even though you’re smart enough to not speak in l33t, to write with correct grammar and basically to be all articulate and everything?

    But…, but…, but…, *head explodes*

  58. David Conrad says:

    "You’re not getting it. I’ve said in almost every post of mine that this method has issues."

    I get it. You’re saying that this method is both broken and not-broken. But I think it’s actually broken OR not-broken. Specifically, broken.

  59. BryanK says:

    You may be able to "reliably" (though I’m not sure how you can justify that part) find the script block.  But you *cannot* reliably find the end.  You can find *an* end, but if it’s not always *the* *same* end, then you do *not* have a reliable parser.

    IMO, in order to "reliably" parse anything, you need to always get the same parse tree, no matter what optional (or "extra") parts of the spec you support.  If you don’t, then the spec has problems.

    Note that I’m not talking about one parser always getting the same result, I’m talking about *all* (compliant) parsers always getting the same result as each other.

  60. BryanK says:

    What’s wrong with that is that there’s no way to *reliably* come up with HTML that includes script code (reliably as in: people always see the same thing before the script runs).  Today, people see different "end" web pages depending on whether they have JS enabled or not, yes.  But if your version of HTML existed, people that didn’t have JS enabled would (potentially) see a bunch of gobbledeygook code in the middle of the HTML, which may also include various other elements that *weren’t* supposed to be output.

    How is it better to (1) not have the script be able to modify the document, *and* *also* (2) see a bunch of text you don’t understand and don’t care about?  That goes against every "fail gracefully" maxim in existence.  JS is supposed to fail gracefully if the user-agent doesn’t understand it; so is CSS.

    > It’s hardly a crime to have a tool that gives more accurate results!

    No, but it is a crime to artificially cripple tools just because they don’t understand the language you used.

    My basic axiom is:  A document should be either valid all the time, or invalid all the time.  You *can’t* make a document’s validity depend on how well the validator understands script.  The whole *point* of validation is to give your document a prayer of showing up the same way regardless of browser (as long as the browser complies with the standard).  If you suddenly introduce some feature that makes random script text show up in some browsers, you’ve broken that.

  61. C’est bien joli de mettre des effets partout… Mais si ça doit rendre un site très lent, la java en vaut-elle le lag ? Pendant la reconception de mon site, je me suis furieusement gratté la tête pour vous.

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index