Date: | October 21, 2014 / year-entry #249 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20141021-00/?p=43803 |
Comments: | 30 |
Summary: | Before you ask a question about regular expressions, you should make sure you and your audience agree on which regular expression language you are talking about. Here is a handy table of which features are supported by which regular expression language. You can use that table to solve this customer's problem: I have a regular... |
Before you ask a question about regular expressions, you should make sure you and your audience agree on which regular expression language you are talking about. Here is a handy table of which features are supported by which regular expression language. You can use that table to solve this customer's problem:
|
Comments (30)
Comments are closed. |
Without even looking, my guess: findstr either does not support (negative) lookahead, or does not support that particular syntax for it.
/C:string Uses specified string as a literal search string.
@Ed: /r: Uses search strings as regular expressions
You're handy table doesn't list findstr specifically, but TechNet does:
technet.microsoft.com/…/bb490907.aspx
Why on earth there's not a switch for findstr to use Microsoft's ECMA implementation (I can understand it not being the default, because that could break things) — which would have a lot higher penetration (I guess the 12 lines of code to initialize the COM object and call through to it would kill the exe size?)… but that's neither here nor there.
findstr apparently doesn't support grouping, capture or negate — it looks like it only supports character ranges and wildcards.
This frustrates me to no end. So many tools support regular expressions and just about all of them are different. It's a mystery why there isn't a standard library that was settled on and used everywhere by now. It's not like the problem is heavily coupled to a particular platform. A straight C implementation of a standard library would work fine anywhere you would want to use it.
Tim: it's not such a big mystery to me. I bet it's simply because by the time people realized how annoying all the incompatible syntax was, it was too late to do much about it without breaking things. On the bright side: at least the list is no longer growing much, as people these days tend to indeed pick one of the popular ones and implement that. Sometimes they even manage to implement it correctly!
@Tim: Lots of open standards have this problem (cf. Markdown). It's endemic.
http://xkcd.com/927/
Kinda stretches the definition of "regular", doesn't it? So many irregular regular expression parsers…
@Tim: There is a standard library for it – the POSIX.2 additions to the C standard library[0][1]. The NT family of Windows OSs has been POSIX.2 compliant for years, I heard somewhere?
Alternatively, there's std::basic_regex in the C++ standard library[2], which handles the ECMAScript, basic POSIX, extended POSIX, awk, grep and egrep families. They were introduced in std::tr1 in 2007, so any modern standards-compliant C++ implementation should have them by now.
/snark
Alternatively, while it's not a de-jure standard, PCRE[3] is a straight C implementation of Perl's regex syntax, and is popular, available for most platforms (including Windows), and in wide use by a large number of applications. It's in the dependency chain of Chrome and Firefox (along with dozens of other packages) on my system, so you may have a copy already installed.
[0] linux.die.net/…/regex
[1] linux.die.net/…/regex
[2] en.cppreference.com/…/regex
[3] http://www.pcre.org/
See, they're just getting too fancy…the pipeline is your friend:
findstr /r /c:"a(?!.*b)" file.txt
findstr a file.txt | findstr /V ab
"a not followed anywhere by a b": that sounds like "a and b were sitting on the pipe", with similar consequences.
@RJB: findstr a file.txt | findstr /V /r /c:"a.*b"
@dave,
And, in contrast to most regexp languages: Snobol is _readable_!
I must admit that I never expected to find anyone defending Snobol in the year 2014. Since I learned it as a student, more than 30 years ago (and came to love it), I have met less than a dozen people who know it at all. I have never seen it in a real, commercial application (but I still have the source code of a primitive "Liza" laying around, dated 1976, if I remember right). Once I made a proposal for an extension to Pascal for embedding Snobol-typoe expressions into that language, but it never got past the drawing board. That was in the early 1980s. a time where regexps were known only to a small fraction of programmers, so I had to both defend the idea of pattern matching and also the value of incorporating it into an algorithmic language. The second part would have been difficult alone, even if people had been open to the first part.
Now that you remind me of it… I hope that I have preserved my old proposal. Maybe I should dig it up and see if I still can defend it…
We should all embed snobol4 interpreters instead. Now *there's* a pattern-matching language.
@J b, I find Regular Expressions to be fairly readable. I mean they can get pretty bad but if you don't go overboard and keep your problem specific they tend not to be too bad.
findstr /R /C:"^[^aA]*[aA][^bB]*$" file.txt
@Wear
That's only true for certain values of 'readable'
Please use the command shell specified by microsoft common engineering criteria for these kind of tasks. It reuse the regex parser from .net with no surprises.
@Wear The ^[^aA]* at the beginning of your regular expression is superfluous as the match will always start at the first A anyway.
Sweet link, thanks! Now I finally know what to look for when reviewing my coworkers’ GNU {B,E}RE for portability. (Though my BSD re have [[:<:]] and [[:>:]] extensions, which aren’t listed on that page, but simple to spot. http://www.mirbsd.org/…/re_format for reference.)
The currently location of the referenced 'flavors' page above is now at http://www.regular-expressions.info/tools.html
@Neil @Myself Or probably I spoke too soon and misread the original problem as to find all lines that don't have an "a" followed by a "b".
@Kevin: There's actually now a Markdown specification and reference implementation called CommonMark http://commonmark.org/ that's intended to be the new standard Markdown. Though it's still not any kind of official standard — it's not ISO/IEC/IETF/ANSI/ECMA/etc., nor was it produced by the original creator of Markdown, John Gruber.
@Jürgen @Neil That was my reading as well. I guess it could be "There is at least one 'a' without 'b's after it", in that case the "^[^aA]*" isn't needed.
@RonO, umm sure, except it doesn't include the table which was the whole point of Raymond including it in the post in the first place.
I've run into the same issue of owning an existing product when that was developed before a standard existed, and having to figure out how to get standard-compliant behavior without breaking things. It can be a difficult engineering problem. In the case of findstr, maybe it would have been simple to add another option. OTOH, I wouldn't be surprised if no one had touched the findstr code since long before any standardization work in this area. I've worked in groups that groups with a strong aversion to making changes to old stable code. There is a theorem (old wives tale?) that for every change in code, there is a non-zero probability of introducing a defect.
Bah. Snobol is for losers. Real programmers use <a href=http://www.cs.arizona.edu/…/a>. Snobol with real structured programming.
@Adam: Within 24 hours of it going up, its name was changed twice, both times in response to objections from Gruber. It no longer calls itself "Markdown." I've yet to see any evidence that it has "caught on" in any significant sense.
@Neil No, the ^[^aA]* is not superfluous. If you leave it out "[aA][^bB]*$" would give a match for "aba".
@Funny How: You're correct. My mistake. :/
@Wear,
I guess the main "philosophical" difference between regexp-variants and Snobol is that Snobol encourages an "algorithmic" approach: You are explicitly lead towards the way the interpreter processes your pattern.
The regexp variants are more of the "predicate" approach, stating requirements to the result. People who love, say, XSLT and Prolog, are likely to love regexp unconditionally. In my student days, I took a course in Prolog, and later I have periodically been forced to work with XSLT. I never caught onto either. I have realized that my nature is the algorithmic way, rather than the predicative.
One great advantage of Snobol's algorithmic approach: You can very easily and readably construct a complex pattern piece by piece, gluing more complex panttern structures together from named, simpler components, analogous to the way you gradually build a complex data structure from lower-level components. That is one of the most essential readability aspects. Those predicative regexp-languages may try to offer a small piece of this functionality, but if at all provided, it is usually limited, and few if any make use of it.
The common way to write regexps is as one super-complex monolitic jungle. If you bring your digital machete to cut through the wilderness, you will most surely be chopping off some essential strand that keeps the whole thing alive… :-)