Date: | June 5, 2006 / year-entry #188 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20060605-00/?p=30983 |
Comments: | 61 |
Summary: | Because it ends the script block, of course. Duh, what's so hard about that? Because if you have script that generates script, you'll find yourself caught out if you're not careful. For example, you can't say document.write(""); in a script block because the HTML parser will see the and conclude that your script block... |
Because it ends the script block, of course. Duh, what's so hard about that? Because if you have script that generates script, you'll find yourself caught out if you're not careful. For example, you can't say document.write("<SCRIPT>blahblah</SCRIPT>");
in a script block because the HTML parser will see the
<SCRIPT> document.write("<SCRIPT>blahblah</SCRIPT>"); </SCRIPT><!-- mismatched tag --> The parser doesn't understand "quoted strings" or "comments" or anything like that. It just looks for the nine characters "<", "/", "S", "C", "R", "I", "P", "T", and ">". When it sees them, it decides that the script block is over and returns to HTML parsing. Why doesn't the parser understand quoted string? Well, in order to parse quoted strings, you have to be able to parse comments: <SCRIPT> /* unmatched quotation mark " ignored since it's in a comment */ </SCRIPT><!-- you might expect this to end the script block -->
But every language has a different comment syntax.
JScript uses /"//"</SCRIPT>is this inside or outside quotes?
That first quotation mark is itself quoted and does not count as
a "beginning of quoted string" marker.
And the It would be unreasonable to expect the HTML parser to be able to understand every language both present and future. (At least not until clairvoyance has been perfected.) <SCRIPT> 'is this a quoted string?'</SCRIPT> Is this inside or outside the script block? '<SCRIPT>' is this a new script block or the continuation of the previous one? </SCRIPT> One "solution" would be to require all languages to conform to one of a fixed number of quotation and comment syntaxes. Nevermind that not even JScript conforms to the basic syntax, as we saw above, thanks to the complicated quotation rules implied by regular expression shorthand. And do you really want all HTML parsers to understand perl?
Another "solution" would be to have the language processor
do the parsing and tell the HTML parser where the
<SCRIPT LANG="unknown-language"> Lorem ipsum dolor sit amet, ... If a language parser were required to locate the end of the script block, it would be impossible to parse past this point. So how do you work around this aspect of HTML parsing? You have to find an alternate way of expressing the string you want. Typically, this is done by breaking in up into two strings that you then reassemble: document.write("<SCRIPT>blahblah</SCRI"+"PT>"); |
Comments (61)
Comments are closed. |
And of course the last is how the MySpace "worm" worked (it wasn’t a worm, it was a self-propogating cross-site-scripting exploit).
Or, alternatively, you could just hide the script in the HTML comments,
<script type=”text/javascript”>
<!–
document.write(‘<script>alert(“Rules!”)</script>’);
–>
</script>
another. Then the title of this article becomes “Why can’t you say
–> in a script block?”:
<script><!– document.write(“Click
-Raymond]here –>”); … –></script>
Oh please. Don’t post something like this if it’s wrong!
In HTML it’s illegal to use the ‘<‘ character in plain text.
You always need to escape it as ‘<’. (I hope this survives
the comment form). Just like in XML and XHTML. It has never been legal.
And all HTML parsers are required to transcode it properly.
So do use </SCRIPT>. Also use <SCRIPT>
for that matter, and use “if (5 < 10) …”. Everything else
is not HTML. It will work because the browsers know that HTML authors
are ignorant, but as this is a non-standard fixup, you can’t rely on it
working and maybe the fixup starts to work in a different way later on?
Then don’t complain about these browsers all rendering your page in
wrong way …
Nice issue with your headline though – the day all earth comes down
in ashes, the sole thing surviving will be a wrongly double-escaped
string ;-)
Except that "<!–" isn’t valid javascript.
If you want a safe alternative, use an external script file:
<script type="text/javascript" src="script.js" />
Just to add to the cavalcade of "except that" or "alternately" comments – document.write() is not just an outdated method, it actually won’t work on proper XHTML documents. See here for more info:
http://ln.hixie.ch/?start=1091626816&count=1
Instead, you should remove the need for writing </SCRIPT> by not actually writing the raw HTML in your code, and creating the node the proper DOM way:
var scriptNode = document.createElement("script");
scriptNode.setAttribute("type", "text/javascript");
etc.
Oops, my previous reply was to Einars Lielmanis.
Other stuff posted since I started writing that:
Adam: Right idea, but your self-closing script tag doesn’t work in IE6SP1 (not sure on SP2). For some reason, that browser requires two separate tags; otherwise it won’t "see" the script. You can’t combine them like that, even though there’s no content and IIRC XML says you should be able to in that case.
And Martin Probst: not all browsers actually parse entity references in script code, even though the content of the script tag is supposed to be "parsed CDATA". I know some don’t change && into && before passing it to the script engine, for instance. I’m not sure about < though — it may work.
OTOH, external scripts "always" work. If you need to handle events, hook them up in window.onload inside the external script, and unhook them in window.onunload if you need to do that to prevent a memory leak.
Yoz: Yep. I’m not sure how to add content to that script tag in the DOM, though — perhaps:
var txtNode = document.createTextNode("script code here");
scriptNode.appendChild(txtNode);
would work? Never tried it (my scripts have always been "static"; I’ve never tried creating one from another script).
[So how do you work around this aspect of HTML parsing? You have to find an alternate way of expressing the string you want. Typically, this is done by breaking in up into two strings that you then reassemble:
document.write("<SCRIPT>blahblah</SCRI"+"PT>");]
Another solution, recommended by comp.lang.javascript, is to escape the forward slash in the </SCRIPT> tag:
<script type="text/javascript">
document.write(‘<script type="text/javascript">alert("hi");</script>’);
</script>
It should work in all web browsers.
According to the HTML language spec (http://www.w3.org/TR/html401/appendix/notes.html#notes-specifying-data), it’s not just </script> that should break out of a script element – it’s the sequence ‘</’. They give the following as an example of script that won’t work:
<SCRIPT type="text/javascript">
document.write ("<EM>This won’t work</EM>")
</SCRIPT>
In reality, this script actually -does- work, at least in IE6 and Firefox…
What’s worrying, to my mind, though, is that following the absolute letter of the HTML spec, the character sequence ‘</’ is simply illegal in a <script> element. Bad news if you want to specify a Javascript regex literal that matches strings ending in a less-than character.
In the course of investigating this I found out that the following actually appears to be syntactically valid Javascript, although it is, of course, semantically utter nonsense:
<script>
var a = / </i> 1;
document.write (a);
</script>
Imagine you’re an XML-ish parser. That does look -awfully- like a well-formedness-breaking closing </i> tag nested inside that <script> element, doesn’t it..?
BrianK: Oh, thanks for that. Unfortunately, I can’t test for and work around every browser incompatibility there is, especially for browsers that aren’t available on my platform. The empty <script> tag works fine in Firefox and Konqueror, and is fine according to the w3c validator, so that’s good enough for me.
I’m also able to really not care as my site is perfectly functional if you have javascript turned off (or just not available), so the worst that will happen for IE users is that they won’t get some of the non-content-related-but-flashy doodahs.
I guess users of Explorer had better keep putting that pressure on their vendor to improve support for the spec if they want their doodahs though! :)
CDATA is the way to go…
Q: How do you embed ]]> in a script block?
A: ]]>
Maurits: But if it’s CDATA, is the > going to be parsed as an entity reference and replaced with the appropriate characters before being passed on to the script engine? Based on what I know of XML, I don’t think it will be.
Adam: Yeah, I wish I didn’t have to do that either; it’s several characters that I wouldn’t have to type every time I refer to a script. Unfortunately for me, everyone in the company uses IE6 SP1 or SP2, and they’d get a little annoyed if the flashy stuff that I told them I was doing didn’t show up. Even if it wasn’t required to use the site (and it isn’t), they’d still be annoyed.
(Actually, I’m not sure whether <script /> works in the IE7 betas either. Haven’t tried it.)
Well, there’s this way too:
http://en.wikipedia.org/wiki/CDATA_section
<![CDATA[foo]]]]><![CDATA[>bar]]>
becomes: foo]]>bar
Except that’s not going to work according to the XML standard:
http://www.mit.edu/~ddcc/xhtmlref/text.html
"Encasing scripts and style sheets in comment delimiters (<!– –>) does not officially work. According to the W3C, the parser may remove all comments before passing the code onto the user agent. In addition, C-like languages, including Javascript, have a decrement operator ("–") that just happens to be the SGML comment delimiter."
Also:
"Interestingly, XML has a special construct designed to deal with the script and style sheet problem. Anything wrapped between "<![CDATA[" and "]]>" is treated as CDATA. Thus, using the same example, the fragment of code could be rewritten this way:
<script type="text/javascript"> <![CDATA[ if (h && i) j(); ]]> </script>
The problem with this solution is that not many browsers understand this synatx either. You might try wrapping the CDATA markers inside comments. (Use the comments of your scripting or style-sheet language, mind you. If you use the SGML-style comments, all sorts of nastiness may ensue.) The other problem is that if your script or style-sheet actually contains the sequence "]]>," you’re out of luck again."
And the last quote from that page:
"Lastly, your best solution may be just to use external scripts and style sheets, avoiding this whole big mess."
Which is what I do.
(This is also part of the "Unobtrusive Javascript" idea, which holds that putting *any* script code inside your HTML file is a mistake. This is for the same reason that using *any* inline style attribute is a mistake — if you want to change the style (or the code), you’ll potentially have to edit all your HTML files, instead of the stylesheet (or script file).)
<script type="text/ecmascript" src="blah.js"></script>
works just fine. ;-)
Still various errors here. The HTML 4.01 spec says:
—
Although the STYLE and SCRIPT elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element’s content. In valid documents, this would be the end tag for the element.
—
Entities must be treated as raw text; < in a script block is just < to a script, and any browser replacing the entity is not conformant.
To the HTML spec, at least. In XHTML, the content type of the script element is PCDATA, so there markup and entities do get parsed. Great, huh?
I agree with previous comments: just make the script external.
Wouldn’t the parser start a new script block at the document.write("<SCRIPT>… and then end *that* one when it finds the </SCRIPT> ?
Un petit article sympa que je viens de lire sur un nouveau blog trouvé ce (dur) matin : celui de Raymond…
If you were changing the script tag syntax, surely the simplest mod would be to be able to specify the closing tag, a la << in perl. Eg.
<script close="THISISREALLYTHEEND">
// </script> That didn’t matter
THISISREALLYTHEEND>
You might or might not specify where the close appear on the line, or if you can/must have a </script> after it as well.
But the HTML parser only needs to know one new thing, and everyone who invents a language where THISISREALLYTHEEND is the assignment operator can just choose another string, maybe a multiiline string.
To say that you can’t determine the language of the script and hence the comment/quoting style is pretty lame.
Clearly IE makes assumptions about the script type ANYWAY, so why not just use that format (the assumed script language) to decide what quotes/comments are?
Don’t forget that each external script file is a synchronous network request that must be processed before parsing of your page can complete (document.write can’t be deferred).
Jack, Silkio: Consider applications that want to parse HTML that aren’t full-blown browsers. Or even browsers that don’t support scripting. Do you think all browsers should have to fully parse even JavaScript just to be able to find the end script tag, even if they’re not going to do anything with the script?
HTML is NOT a programming language. It is a document markup language. A parser should be able to determine where the markup sections start and stop with /relative/ simplicity.
On top of this, even according to the HTML specs (particular wording here taken from HTML 4.01 Appendix B.1 at http://www.w3.org/TR/html4/appendix/notes.html but most versions of HTML have a section like it) a user-agent should be able to handle markup it doesn’t recognise:
* If a user agent encounters an element it does not recognize, it should try to render the element’s content.
Of course, rendering content shouldn’t apply to elements in the <head> of a document, but such a user-agent should still be able to reliably find the end of the element it needs to ignore.
Neil: The external script file (as with external style sheet files) should be cacheable, even if the rest of your site isn’t. e.g. if it’s some kind of database-backed shopping site, etc…
Using external script files can decrease your total bandwidth usage quite a bit, and may well speed up all page views to your site bar the first if the script(s) are large.
Jack, in order for the "heredoc" (the name of the Unix shell feature that Perl includes) paradigm to work, it would need to have been included in HTML back in the early 1990s. Why?
Well, an HTML parser that doesn’t know about scripts would never know to look for the sentinel at the end of the script. The only way for the naive parser to be able to find it is if the ‘close’ attribute were already defined as an option on every single element. That way a parser would automatically look for a ‘close’ attribute on every tag, whether it understands the tag or not.
Similarly, a ‘render=false’ option would have been nice also. That way in-line scripts and stylesheets would have a way to indicate to downlevel browsers that they should not render the contents of their tags.
Adam:
Like I said, if the browser chooses not to process the javascript, then there is no issue. The only confusion occurs is where PART of your javascript is processed due to a script tag in the middle of it.
I’m not asking non-"script" processing browsers to start processing it, I’m saying that if they DO process the script, and DO execute part of it, we have a right to be a little upset that they decided to be ignorant about the script tag embedded inside.
Sorry silkio, but it still won’t work. Any browser that doesn’t know your scripting language won’t know when to stop parsing as script and when to start parsing as HTML again.
A browser that doesn’t understand ANY scripts will render them all as text anyway, so it won’t care about whether the </SCRIPT> tag embedded in them should be rendered or not.
However any browser (Lynx comes to mind) that knows about scripts will never want to render the content of the script block whether it knows how to parse the language or not. You want to be able to write:
<script language="PerlScript">
# this is a <script></script> block
document.write(qq!<script>$script</script>!)
</script>
Unfortunately, only browsers that understand PerlScript know how to parse it properly. All others would show "block document.write(qq!!)". Since PerlScript it a pluggable script engine, my browser understands it but yours might not.
Silkio:
But this isn’t just about what browsers that do understand javascript have to do.
You can’t say "</script>" in a script block because browsers/parsers/applications that don’t understand scripts still have to be able to tell where the end of the script block is so they can process the rest of the page correctly.
If you allow "</script>" in the script block in any form (either in a literal string, or in some other case) then all these other programs need to understand enough javascript to be able to spot a literal string, and a regular expression, and a comment, etc, etc, etc… Basically, they need to fully understand javascript in order to find the "real" end of the script.
For this reason, you cannot allow "</script>" tags inside a script, in order for non-script processors to be able to understand HTML.
Therefore, even browsers that do understand scripts cannot allow this either. If such browsers did support embedded "</script>" tags, people would write code that did that, test it in their browser, see that it worked as expected and assume it was fine. However, non-script-enabled browsers would just break, having found the "</script>" tag that they thought was the end and trying to process the rest of the script as HTML.
Gabe/Adam:
What I’m saying is that if the browser processes the script at all, it can’t pretend that it doesn’t know how to find comments/text.
That is to say, the following code:
============
<script>
alert(1);
//</script>
alert(2);
</script>
=============
should produce either:
– alert 1
– alert 2
or
– NOTHING.
i.e, the processor has already figured out what language it is, so why does it sit back and declare "oh no, i don’t know how to find comments, i’ll just end now."
and about lynx; even though it’s textbased it still needs (or at least should) process script … obviously not all script is for visual purposes.
Silkio> "and about lynx; even though it’s textbased it still needs (or at least should) process script … obviously not all script is for visual purposes."
Ah – now I see where our differences lie. :)
This is the statment I disagree with. And I’m not conviced why it should be the case. IMO, an HTML parser should be able to parse HTML without having to be able to parse JavaScript too. Why do you think otherwise?
Also, not all HTML parsers are in web browsers.
What about spiders, like googlebot? Should that have to be able to parse javascript so that it doesn’t think you have the text "alert(2);" in your web page?
Adam:
I don’t think browsers should have to parse javascript or any script. What I’m saying, though, is that if they TRY, then must assume a certain type of script to do so.
IE will assume javascript, if it’s not specified.
For example, the following won’t popup a message box in IE unless "type=’vbscript’" is specified:
============
<script>
MsgBox("1")
</script>
============
I still think you are missing my main point … that if the browser DOES try and guess the language (which IE clearly does) then it’s a lie to say you don’t know how to resolve comments and strings.
Spiders will need to process javascript anyway, but if they don’t, that’s totally fine, they can just find the </script> block where it lies. Other script-aware HTML parsers have no programmatic excuse for acting so ignorant.
Silkio> "Spiders will need to process javascript anyway …"
Why? Please explain the logic underlying that conclusion.
Silkio> "…but if they don’t, that’s totally fine, they can just find the </script> block where it lies."
Huh? But if you have an embedded </script> tag, that’s the one that non-javascript-aware HTML parsers will hit. They can’t find the proper end of the script block. That’s the whole point! That’s why embedded </script> tags must be disallowed.
I give up.
I’ll just say that I’d hate to try to write an HTML parser one weekend in a world where you controlled the HTML standards. :)
"Consider the page consisting of:
<script language="javascript">document.location = ‘realhomepage.html’;</script>"
You don’t need Javascript to do redirects. Use either an HTTP header or a <meta> tag to redirect instead.
Adaim:
:|
> Silkio> "Spiders will need to process
> javascript anyway …"
>
> Why? Please explain the logic underlying that
> conclusion.
Consider the page consisting of:
<script language="javascript">document.location = ‘realhomepage.html’;</script>
my point was not "All bots should process javascript so that can discover valid ending script tags" it was "a smart bot will process javascript [which isn’t really relevant to our discussion anyway]".
> Silkio> "…but if they don’t, that’s totally > fine, they can just find the </script> block > where it lies."
>
> Huh? But if you have an embedded </script>
> tag, that’s the one that non-javascript-aware
> HTML parsers will hit. They _can’t_ find the
> proper end of the script block. That’s the
> whole point! That’s why embedded </script>
> tags _must_ be disallowed.
Because different parsers may see different html output based on what scripting language they support? Sure, I agree. That’s seems like an okay reason to disallow it … not great, but …
All I was initially trying to say is that Raymond’s comment that he doesn’t know HOW to find the script tag is not true. IE *can* figure out if it’s a valid ending tag if it wanted to, but it doesn’t want to. It’s not a programming problem (like Raymond was trying to say) it’s a logical one … :)
Ross: Yes, but that’s not at all the point.
Adam: :) Come on now, I’m really not trying to say that all parsers
have to implement script-parsing, I am just saying that if they do,
they should be able to detect a script tag.
How would you write the parser that figures out which of those “</SCRIPT>”s is the real /SCRIPT tag? -Raymond]
Raymond: What do you mean? I don’t know “leandre” language, so I would take the first one.
If I did know leandra language, I would try and parse that script
using it’s grammar. If I failed (i.e invalid token or something), i’d
take the next /script tag at the index of where i failed. if I passed,
I’d know where it ends and also take the next /script tag.
…
?
If it doesn’t know leandra it takes the first tag.
This is what I’ve been saying all along. If you can’t/dont want to process the language, then it doesn’t need to try.
other words, you are okay with a web page parsing in two completely
different ways depending on whether the browser supports the leandra
language. Good luck writing an HTML validator! And if the browser does support it, it’ll have to load the leandra parser in order to find the correct end tag, which completely ruins the DEFER attribute. Wait a second, i’m just repeating myself. This is exactly what I wrote in the main article! -Raymond]
Raymond:
In the OP you said it wasn’t possible. Clearly it is possible; that’s all I was trying to say.
I’ve never used the “DEFER” attribute so I’ll take your word that it will become useless.
I don’t mind the way it currently works, [not that I’d expect a
chance if I didn’t] and I don’t neccessarily think that it would be bad
thing to ask the language for the ending /script tag. It seems to be
the most intuitive.
Either way, all I was trying to point out is that your OP is wrong …
Silkio: Are you under some kind of delusion that javascript is either part of the HTML spec, or that HTML has some kind of special relationship with javascript that it doesn’t have with other scripting languages (such as leandre)?
AC: no.
AC.
As I said in my earlier posts, I don’t think this is the best way for the parses to operate; I was only commenting on Raymonds wording in the OP.
–
To answer your question though; the processing of javascript can change the document that the parsers parse anyway, so what’s the big deal?
Both parsers are correct; one just implements JS and is able to understand the document better.
Raymond: What’s wrong with this process:
– get script language
– if known lang, request /script index
– if unknown lang, /script index = first from this point
..?
(Other then the fact it will lead to different parsing structure depending on known languages.)
OK. If that particular script does nothing than execute those two alerts, and does not change the document in any way, and assuming that the rest of the document is otherwise valid, would you say that the document should be considered valid HTML 4?
If you think the document is valid HTML 4, how can you claim that the non-js-aware parser is correct if it misidentifes the document as being invalid?
(And as an aside – what would be a good way for the non-js-aware parser to operate?)
Raymond:
Ok. We disagree then.
AC: As I said, this method has issues, but it’s not impossible. Clearly the fact that the tags you choose to use are up to the script will then change the way your document validates. I don’t actually think it’s bad that document validation depends on script; browsers interpret script as well, and you can currently easily make a valid document invalid with the simplest bit scripting.
Humour me. This exact document:
<!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>test</title>
<script>
alert(1);
//</script>
alert(2);
</script>
</head>
<body>
<p>test</p>
</body>
</html>
Should it be considered valid HTML? Yes or no?
Doh. Missed ‘type="text/javascript"’ attribute from the <script> tag. Pretend it’s there.
AC: Well it depends if the validator implements the javascript language, doesn’t it :)
I am the validator, and I happen to understand javascript, so yes, I will call that document valid.
Therefore, are you saying that an HTML parser that correctly follows every last paragraph of the HTML spec, but is not js-aware, is an incorrect HTML parser[0]?
Don’t you think there’s a slight internal inconsistency with that?
(Have you ever been to Milliways restaurant?)
[0] Or do you think that the correctness of a parser can be measured other than by whether it follows the spec it is implementing, and whether it can determine if a document also follows that spec?
No, it doesn’t depend if the validator implements the javascript language.
The definition of HTML 4.01 (as defined in (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">) does not understand javascript.
Therefore, by definition, this is not a valid HTML document. Period. End of story. No amount of asserting that the parser understands multiple languages will make it a valid 4.01 document.
It may be a valid Silkio-browser document, but it is not a valid HTML 4.01 document.
Guys:
You’re not getting it. I’ve said in almost every post of mine that this method has issues. I agree that a non-js-parsing validator will see this documetn as invalid. That’s what I said would happen.
I even said I don’t think this is a great way of doing things; but on othe other hand, is it so bad that scripting languages – something which can CURRENTLY affect the structure of a doc – become a part of the document validation process? I don’t think so.
OF COURSE this method does not meet the current HTML 4.01 spec.
To clarify: If “silkio-validating” was implemented, the spec would need to be changed.
Raymond: Right.
The suggestion here is that "parsing" of a html document should include executing and processing of the script areas.
(Yes, there are issues with this. Putting it lightly…)
HOW?
I’m trying to be hypothetical and move away from the current spec into a "how would you have this work then" kind of vibe, and trying to understand where you’re coming from, but…
How would you change the spec so, such that a validating parser could possibly be written without being internally inconsistent?
AC:
By having the spec include a spec for processing javascript? But then you’d have to have specs for all scripting languages …
So the spec could just say : "execute and process script to get actual page structure".
Then each validator/"parser" can implement as many scripting languages as it likes, and validate away, meeting the spec!
If it doesn’t have any scripting langauges, then that’s fine, it can’t get the desired page structure and will call the doc invalid due to unknown script. [of course it can still display as much of the doc as it could].
Obviously all hypothetical, and written on the fly, so give me a break if I made a typo/omission :)
Hopefully it’s clear …
Oh.
So in that case, for the following HTML that you posted:
<script>
alert(1);
//</script>
alert(2);
</script>
You think that:
1) A js-aware HTML parser should see the first </script> as a comment, the "alert(2);" as part of the script, and the second </script> as the end-of-script tag.
2) A non-js-aware HTML parser should see the first </script> as the end-of-script, the "alert(2);" as text in the enclosing element (should be the enclosing head element, where text is not allowed), and the second </script> as an invalid close tag that doesn’t match an open tag.
Is that correct? Two validating HTML parsers, which both fully support the HTML 4 spec, should be able to treat the same HTML document differently, and for one to find the document valid, and the other to find it broken?
Which parser is correct? Both? Is one more correct than the other? If one *is* more correct than the other, why should the "less correct" parser not need to be fixed to be "more correct"?
AC: [if you can still read after that exploison].
The spec isn’t change so that it’s impossible to reliably pass. You can still reliably find the script blocks of a document. All you do is write your parser so that it does NOT implement a scripting language, and simply goes for the first /script it finds.
If you DO choose to write a parser that understands a given script language, your parser just must request the ending script tag from the appropriate language parser.
I don’t see how this suggests a spec which is impossible to reliably parse.
BryanK:
Different “types” of parses will get different results. Javascript-enabled parsers will get better results.
–
Think about what the spec is trying to do. Scripts have a special
ability in HTML; they can modify the code that is parsed. They can
create more, or delete some, or change parts. This affects the end
result to the user, and the reason we even have validators is so that
the end user sees the same thing everywhere [in all browsers].
The point is, by processing/parsing the SCRIPT of a document, you can gain a better understanding of it.
This can only be a good thing.
You are saying it’s bad because a given validator understands a given doc better then another; I say, what’s wrong with that?
It’s hardly a crime to have a tool that gives more accurate results!
So, you are actually saying that you want to change the spec so that it’s impossible to write a parser that can reliably tell which parts of the document are _not_ script, and therefore have _no way_ of knowing which parts of the document it is even _meant_ to process.
No, I can make that simpler. You are actually saying that you want to change the spec so that HTML is impossible to reliably parse.
And you don’t think that just having to break up "</script>" in a string literal or comment is the simpler and more sensible thing to do? Even though you’re smart enough to not speak in l33t, to write with correct grammar and basically to be all articulate and everything?
But…, but…, but…, *head explodes*
"You’re not getting it. I’ve said in almost every post of mine that this method has issues."
I get it. You’re saying that this method is both broken and not-broken. But I think it’s actually broken OR not-broken. Specifically, broken.
You may be able to "reliably" (though I’m not sure how you can justify that part) find the script block. But you *cannot* reliably find the end. You can find *an* end, but if it’s not always *the* *same* end, then you do *not* have a reliable parser.
IMO, in order to "reliably" parse anything, you need to always get the same parse tree, no matter what optional (or "extra") parts of the spec you support. If you don’t, then the spec has problems.
Note that I’m not talking about one parser always getting the same result, I’m talking about *all* (compliant) parsers always getting the same result as each other.
What’s wrong with that is that there’s no way to *reliably* come up with HTML that includes script code (reliably as in: people always see the same thing before the script runs). Today, people see different "end" web pages depending on whether they have JS enabled or not, yes. But if your version of HTML existed, people that didn’t have JS enabled would (potentially) see a bunch of gobbledeygook code in the middle of the HTML, which may also include various other elements that *weren’t* supposed to be output.
How is it better to (1) not have the script be able to modify the document, *and* *also* (2) see a bunch of text you don’t understand and don’t care about? That goes against every "fail gracefully" maxim in existence. JS is supposed to fail gracefully if the user-agent doesn’t understand it; so is CSS.
> It’s hardly a crime to have a tool that gives more accurate results!
No, but it is a crime to artificially cripple tools just because they don’t understand the language you used.
My basic axiom is: A document should be either valid all the time, or invalid all the time. You *can’t* make a document’s validity depend on how well the validator understands script. The whole *point* of validation is to give your document a prayer of showing up the same way regardless of browser (as long as the browser complies with the standard). If you suddenly introduce some feature that makes random script text show up in some browsers, you’ve broken that.
C’est bien joli de mettre des effets partout… Mais si ça doit rendre un site très lent, la java en vaut-elle le lag ? Pendant la reconception de mon site, je me suis furieusement gratté la tête pour vous.