Date: | September 16, 2004 / year-entry #339 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040916-00/?p=37853 |
Comments: | 8 |
Summary: | The RegexOptions.ECMAScript flag changes the behavior of .NET regular expressions. One of the changes I had discussed earlier was with respect to matching digits. For those who want to know more, a summary of the differences is documented in MSDN under the devious title "ECMAScript vs. Canonical Matching Behavior". Apparently some people had trouble finding... |
The Apparently some people had trouble finding that page, so I figured I'd point to it explicitly. |
Comments (8)
Comments are closed. |
The ECAMScript rules are presumably in order to be compliant with the ECMAScript standard. You’ll have to ask them why their behavior is as it is. (I suspect it’s for unix compatibility.)
I don’t recall the unixy tools (well, Perl) treating "7" as "7" if there is no seventh backreference. Then again, I never ran into a situation where I did not know exactly how many backreferences there were. Regexps that are more than 80 characters or so are very hard to maintain.
Ah, a google for [perl regexp backreference] yields:
http://www-2.cs.cmu.edu/People/rgs/pl-regex.html
"Within the pattern, 10, 11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) 10 is the same as 10, a backspace, and 11 the same as 11, a tab. And so on. (1 through 9 are always backreferences.)"
But ECMAScript was an attempt to standardize JavaScript during the DOM explosions and JScript/JavaScript battles. Just because JavaScript regexps were similar to Perl (the most regexp-laden language I can think of) doesn’t mean ECMA tried to match Perl. The ECMAScript specification is pretty coarse reading, even for one used to reading specs…
I was referring to classic sed and awk. Perl’s regex rules are "modern" by comparison.
I’m not sure if treating n as n when there is no nth grouping instead of raising an error is right or wrong, but it doesn’t really bother me. I expect to have to debug my regexps, and I think that seeing n where I expected the nth backreference would be pretty obvious.
The goal of ECMAScript is to get through the user’s program unless there is absolutely no way to figure out what’s going on. That’s an essential characteristic shared by many languages designed for scripty tasks. If you find yourself wanting more robust error handling, you may be using the wrong tool for the job.
Raymond: Did awk (or sed) support backreferences? I did not remember them from awk… but it’s been over a decade for me. Awk had $1 and the like, but I didn’t recall them in the regexps.
Nicholas: I don’t care so much about robust error handling in ECMAScript, but I would rather it throw an error than hide one. Debugging JavaScript is… not fun.
That is indeed a devious title. "Canonical" invokes images of canon, or accepted protocol. Not to bash, but .net’s regexp implementation is by no means canon. (it’s closer to the average than I expected, though: bonus!)
The ECMAScript behaviour of "use backreferences if defined, otherwise literal" does seem odd. Why would you write "7" when you wanted a literal seven? Seems like an easy way to hide errors.
On the other hand, I don’t like the "canonical" "nnn" rule. A 400th backreference? I guess it’s nice not having to have hard limits, but I’m a regexp freak, and I have never needed more than five. Once you have that many you’re better off (for readability) using a procedure to assist regexps in splitting things apart.
Thanks for the link. I’m not doing much .net now, but even knowing that a little thing like this exists somewhere will make it much easier to find when I *do* need it.