Regular expressions and the dreaded *? operator

Comments (10)

asdf says:

March 25, 2004 at 8:53 pm

That’s why I always use something like (?:".*?")>
ShowUsYour says:

March 25, 2004 at 8:56 pm
asdf says:

March 25, 2004 at 10:00 pm

Which of course didn’t apply to the thing you were talking about.
Norman Diamond says:

March 25, 2004 at 10:24 pm

If *? really matched as few characters as necessary to make the pattern succeed, then the matched portion of

"hello"world">

would be

"world">

for either pattern

".*?">

or

"[^"]*">

The inclusion of

"hello

was due to left-handed greediness.

So ONLY smart people would be affected by this designer’s mistake. Dumb people wouldn’t be affected because we’ve used things like "[^"]*"> all our lives (or at least since the time Bell Labs started playing with a Honeywell computer) and we never learned about *?
Ben Hutchings says:

March 26, 2004 at 7:59 am

Matt: That’s what I was thinking, but everything after the "</div>" has to match as well, so the "(.*?)" could capture characters beyond the first "</div>". Some regex systems allow you to "commit" and disallow backtracking after matching up to some point in the regex, which you would want to do after the "</div>". In general you don’t have this option.
Matt C. Wilson says:

March 26, 2004 at 8:17 am

True. I guess I was thinking about just matching <div> contents tag by tag. But a nested div would blow that up. It would be really nice if there were some kind of recursive operation, that would allow you to say something like "give me everything inside <div></div>, with up to 2 nested <div></div> sequences" Is there such a beast, of which I am again unaware?

Then of course you have to pray that the html source is well-formed :)
Ben Hutchings says:

March 26, 2004 at 10:23 am

That sounds like a job for a relaxed HTML parser (such as IE and Moz use when they don’t see a DTD) and DOM. I don’t know that it’s possible to use just the parser and DOM from these without showing a page though.
foo says:

March 26, 2004 at 5:19 am

I use things like "[^"]*"> all the time. I know about *?, but never found it useful. IIRC, it’s called "lazy quantifier" in the Perl world.

Can somebody provide an interesting example when you would want to use *?, maybe because an equivalent regex with greedy quantifiers is more complex?
Matt C. Wilson says:

March 26, 2004 at 7:02 am

Count me in the ranks of the dumb then. I never knew *? existed. I always use [^"]* syntax.

I can see this being really helpful parsing html, as in

<div>(*?)</div> as a search target.
Ben Hutchings says:

March 26, 2004 at 7:55 am

I’ve been scraping data off web pages and using ".*?" to skip over uninteresting bits that may contain many tags. I’ve since stopped doing that and changed the regexes as I get around to it, because it’s quite dangerous.

First, if you’re doing a global search to find multiple matches in a page, and the regex isn’t quite right, then "A.*?B" (where A and B are themselves regexes) can match the A part of one instance, fail to match the B part of that instance and so end up matching the B part of the next instance, so it captures a mixture of two records! (This is just a worse version of what Raymond pointed out.)

Second, a search using a regex with multiple lazy parts can take a very long time to fail, at least in Python, due to backtracking. (This can happen with multiple greedy parts but seems not to do so in general; I don’t understand the theory well enough to see why this might be.)

Comments are closed.

Date:	March 25, 2004 / year-entry #116
Tags:	code
Orig Link:	https://blogs.msdn.microsoft.com/oldnewthing/20040325-00/?p=40073
Comments:	10
Summary:	The regular expression ? operator means "Match as few characters as necessary to make this pattern succeed." But look at what happens when you mix it up a bit: ".?" This pattern matches a quoted string containing no embedded quotes. This works because the first quotation mark starts the string, the .*? gobbles up everything...