The great thing about regular expression languages is that there are so many to choose from!

Date:October 21, 2014 / year-entry #249
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20141021-00/?p=43803
Comments:    30
Summary:Before you ask a question about regular expressions, you should make sure you and your audience agree on which regular expression language you are talking about. Here is a handy table of which features are supported by which regular expression language. You can use that table to solve this customer's problem: I have a regular...

Before you ask a question about regular expressions, you should make sure you and your audience agree on which regular expression language you are talking about.

Here is a handy table of which features are supported by which regular expression language.

You can use that table to solve this customer's problem:

I have a regular expression that works just fine when I test it in ⟨insert regular expression testing tool, like RegExr or RegexPlanet⟩, but I can't get it to work in real life.

C:\> findstr /r /c:"a(?!.*b)" file.txt
(prints no results!)
C:\>

My goal is to find all lines that contain an a not followed anywhere by a b.


Comments (30)
  1. Kevin says:

    Without even looking, my guess: findstr either does not support (negative) lookahead, or does not support that particular syntax for it.

  2. Ed says:

    /C:string  Uses specified string as a literal search string.

  3. Phil says:

    @Ed: /r: Uses search strings as regular expressions

  4. Dave Bacher says:

    You're handy table doesn't list findstr specifically, but TechNet does:

    technet.microsoft.com/…/bb490907.aspx

    Why on earth there's not a switch for findstr to use Microsoft's ECMA implementation (I can understand it not being the default, because that could break things) — which would have a lot higher penetration (I guess the 12 lines of code to initialize the COM object and call through to it would kill the exe size?)…  but that's neither here nor there.

    findstr apparently doesn't support grouping, capture or negate — it looks like it only supports character ranges and wildcards.

  5. Tim says:

    This frustrates me to no end. So many tools support regular expressions and just about all of them are different. It's a mystery why there isn't a standard library that was settled on and used everywhere by now. It's not like the problem is heavily coupled to a particular platform. A straight C implementation of a standard library would work fine anywhere you would want to use it.

  6. Mark VY says:

    Tim: it's not such a big mystery to me.  I bet it's simply because by the time people realized how annoying all the incompatible syntax was, it was too late to do much about it without breaking things.  On the bright side: at least the list is no longer growing much, as people these days tend to indeed pick one of the popular ones and implement that.  Sometimes they even manage to implement it correctly!

  7. Kevin says:

    @Tim: Lots of open standards have this problem (cf. Markdown).  It's endemic.

    http://xkcd.com/927/

  8. DWalker says:

    Kinda stretches the definition of "regular", doesn't it?  So many irregular regular expression parsers…

  9. Karellen says:

    @Tim: There is a standard library for it – the POSIX.2 additions to the C standard library[0][1]. The NT family of Windows OSs has been POSIX.2 compliant for years, I heard somewhere?

    Alternatively, there's std::basic_regex in the C++ standard library[2], which handles the ECMAScript, basic POSIX, extended POSIX, awk, grep and egrep families. They were introduced in std::tr1 in 2007, so any modern standards-compliant C++ implementation should have them by now.

    /snark

    Alternatively, while it's not a de-jure standard, PCRE[3] is a straight C implementation of Perl's regex syntax, and is popular, available for most platforms (including Windows), and in wide use by a large number of applications. It's in the dependency chain of Chrome and Firefox (along with dozens of other packages) on my system, so you may have a copy already installed.

    [0] linux.die.net/…/regex

    [1] linux.die.net/…/regex

    [2] en.cppreference.com/…/regex

    [3] http://www.pcre.org/

  10. RJB says:

    See, they're just getting too fancy…the pipeline is your friend:

    findstr /r /c:"a(?!.*b)" file.txt

    findstr a file.txt | findstr /V ab

  11. Alex Cohn says:

    "a not followed anywhere by a b": that sounds like "a and b were sitting on the pipe", with similar consequences.

  12. EvilKiru says:

    @RJB: findstr a file.txt | findstr /V /r /c:"a.*b"

  13. j b says:

    @dave,

    And, in contrast to most regexp languages: Snobol is _readable_!

    I must admit that I never expected to find anyone defending Snobol in the year 2014. Since I learned it as a student, more than 30 years ago (and came to love it), I have met less than a dozen people who know it at all. I have never seen it in a real, commercial application (but I still have the source code of a primitive "Liza" laying around, dated 1976, if I remember right). Once I made a proposal for an extension to Pascal for embedding Snobol-typoe expressions into that language, but it never got past the drawing board. That was in the early 1980s. a time where regexps were known only to a small fraction of programmers, so I had to both defend the idea of pattern matching and also the value of incorporating it into an algorithmic language. The second part would have been difficult alone, even if people had been open to the first part.

    Now that you remind me of it… I hope that I have preserved my old proposal. Maybe I should dig it up and see if I still can defend it…

  14. dave says:

    We should all embed snobol4 interpreters instead.  Now *there's* a pattern-matching language.

  15. Wear says:

    @J b, I find Regular Expressions to be fairly readable. I mean they can get pretty bad but if you don't go overboard and keep your problem specific they tend not to be too bad.

    findstr /R /C:"^[^aA]*[aA][^bB]*$" file.txt

  16. Steve says:

    @Wear

    That's only true for certain values of 'readable'

  17. 640k says:

    Please use the command shell specified by microsoft common engineering criteria for these kind of tasks. It reuse the regex parser from .net with no surprises.

  18. Neil says:

    @Wear The ^[^aA]* at the beginning of your regular expression is superfluous as the match will always start at the first A anyway.

  19. mirabilos says:

    Sweet link, thanks! Now I finally know what to look for when reviewing my coworkers’ GNU {B,E}RE for portability. (Though my BSD re have [[:<:]] and [[:>:]] extensions, which aren’t listed on that page, but simple to spot. http://www.mirbsd.org/…/re_format for reference.)

  20. RonO says:

    The currently location of the referenced 'flavors' page above is now at http://www.regular-expressions.info/tools.html

  21. Jürgen says:

    @Neil @Myself Or probably I spoke too soon and misread the original problem as to find all lines that don't have an "a" followed by a "b".

  22. Adam Rosenfield says:

    @Kevin: There's actually now a Markdown specification and reference implementation called CommonMark http://commonmark.org/ that's intended to be the new standard Markdown.  Though it's still not any kind of official standard — it's not ISO/IEC/IETF/ANSI/ECMA/etc., nor was it produced by the original creator of Markdown, John Gruber.

  23. Wear says:

    @Jürgen @Neil That was my reading as well. I guess it could be "There is at least one 'a' without 'b's after it", in that case the "^[^aA]*" isn't needed.

  24. Funny How says:

    @RonO, umm sure, except it doesn't include the table which was the whole point of Raymond including it in the post in the first place.

  25. KC says:

    I've run into the same issue of owning an existing product when that was developed before a standard existed, and having to figure out how to get standard-compliant behavior without breaking things.  It can be a difficult engineering problem.  In the case of findstr, maybe it would have been simple to add another option.  OTOH, I wouldn't be surprised if no one had touched the findstr code since long before any standardization work in this area.  I've worked in groups that groups with a strong aversion to making changes to old stable code.  There is a theorem (old wives tale?) that for every change in code, there is a non-zero probability of introducing a defect.  

  26. Eric says:

    Bah.   Snobol is for losers.   Real programmers use <a href=http://www.cs.arizona.edu/…/a&gt;.  Snobol with real structured programming.

  27. Kevin says:

    @Adam: Within 24 hours of it going up, its name was changed twice, both times in response to objections from Gruber.  It no longer calls itself "Markdown."  I've yet to see any evidence that it has "caught on" in any significant sense.

  28. Jürgen says:

    @Neil No, the ^[^aA]* is not superfluous. If you leave it out "[aA][^bB]*$" would give a match for "aba".

  29. RonO says:

    @Funny How: You're correct. My mistake. :/

  30. j b says:

    @Wear,

    I guess the main "philosophical" difference between regexp-variants and Snobol is that Snobol encourages an "algorithmic" approach: You are explicitly lead towards the way the interpreter processes your pattern.

    The regexp variants are more of the "predicate" approach, stating requirements to the result. People who love, say, XSLT and Prolog, are likely to love regexp unconditionally. In my student days, I took a course in Prolog, and later I have periodically been forced to work with XSLT. I never caught onto either. I have realized that my nature is the algorithmic way, rather than the predicative.

    One great advantage of Snobol's algorithmic approach: You can very easily and readably construct a complex pattern piece by piece, gluing more complex panttern structures together from named, simpler components, analogous to the way you gradually build a complex data structure from lower-level components. That is one of the most essential readability aspects. Those predicative regexp-languages may try to offer a small piece of this functionality, but if at all provided, it is usually limited, and few if any make use of it.

    The common way to write regexps is as one super-complex monolitic jungle. If you bring your digital machete to cut through the wilderness, you will most surely be chopping off some essential strand that keeps the whole thing alive… :-)

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index