Date: | April 23, 2010 / year-entry #119 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20100423-00/?p=14263 |
Comments: | 20 |
Summary: | A customer asked for help writing a regular expression that, in the customer's words, matched the string %1 when it appeared as a standalone word. Match No match %1 %1b :%1: x%1 One of the things that people often forget to do when asking a question is to describe the things that they tried and... |
A customer asked for help writing a regular expression that, in the customer's words, matched the string
One of the things that people often forget to do when asking a question is to describe the things that they tried and what the results were. This is important information to include, because it saves the people who try to answer the question from wasting their time repeating the things that you already tried.
That last entry was just to make sure that the test app was working, a valuable step when chasing a problem: First, make sure the problem is where you think it is. If the ^..$ hadn't worked, then the problem would not have been with the regular expression but with some other part of the program. "Is the \b operator broken?" No, the \b operator is working just fine. The problem is that the \b operator doesn't do what you think it does. For those not familiar with this notation, well, first you were probably confused by the \b in the original question and skipped the rest of this article. Anyway, \w matches A through Z (either uppercase or lowercase), a digit 0 through 9, or an underscore. (It's actually more complicated than that, but the above description is good enough for the current discussion.) By contrast, \W matches every other character. And in regular expression speak, a "word" is a maximal contiguous string of \w characters. Finally, the \b operator matches the location between a \w and a \W, treating the beginning and end of the string as an invisible \W. I will stop mentioning the pretend \W at the ends of the string; just mentally insert them where applicable. Okay, let's go back to the original regular expression of \b%1\b. Notice that the percent sign is not one of the things which is matched by \w. Therefore, in order for the \b that comes before it to match, the character before the percent sign must be a \W. That way, the \b comes between a \w and a \W. The pattern \b%1\b means "A percent sign which comes after a \w, followed by a 1 which comes before a \W." Looking at it another way, the string %1 breaks down like this:
There is a \b between the % and the 1 and another one between the 1 and the end of the string, but there is no \b before the percent sign, because that location has \W on both sides. The question started off on the wrong foot: You are having trouble writing a regular expression that matches a word that begins with % because there are no words which begin with %. The percent sign is not a \w and therefore cannot be part of a word. What the customer is looking for is something more like (?<!\w)%1\b, a regular expression which means a percent sign not preceded by a \w, followed by a 1 which comes before a \W. The customer realized the mistake once it was pointed out. "I keep forgetting that I can't get % included in \w just because I want it to." Michael Kaplan covered this same topic some time ago . |
Comments (20)
Comments are closed. |
As the old adage goes:
“Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.”
Was beaten to it by Marquess.. RegEx "work", but getting it right can be a non-trivial pain..
I meant B%1b of course.
The real problem is the common one of under-specification. Without a more clear defintion of "standalone word" in the customer’s problem domain, any proposed solution is a crapshoot.
Every time I hear the word RegEx I reach for my gun.
I keep regexlib.com in my link bar just for such occasions. I do have to admit to loving regular expressions though.
RegEx is very powerful.. so much so that it is easy to fall into the trap of using it.
Not Microsoft bashing here, but there was a recent security exploit for Explorer involving its dynamic generation of RegEx expressions at runtime.
You read that correctly. The dynamic generation of RegEx expressions at runtime.
I will now repeat..
RegEx is very powerful.. so much so that it is easy to fall into the trap of using it.
Do they really want to match things like "-%1" but not "a%1"?
If so this works*: B%1 where B means "not a word boundary".
I suspect not, but I can’t make a guess as to their real requirements without knowing more about the situation.
* It works in Perl 5.8, anyway; every regular expression implementation seems to have their own rules.
"A customer asked for help writing a regular expression" – wow. There isn’t enough time in the day for that sort of helpless customer. Tell them to Google for it or to switch to Linux.
“Tell them to […] switch to Linux.”
Well, now they have *three* problems. How exactly does switching to Linux help with regexes?
"Therefore, in order for the b that comes before it to match, the character before the percent sign must be a W."
According to expression "b%1b" % matches /W so the character before the percent sign should be /w.
@Marquess – "How exactly does switching to Linux help…" – by freeing up the MS customer lines for people with useful questions.
Coming from perl, what really puzzles me is why noone bothered to try /(^|s)%1($|s)/ (aaah, line noise!). Or is matching whitespace forbidden magic that should never be used under any circumstance?
Clovis: If this were paid support, then the easy questions are welcomed of course.
Btw, I’m not aware that I can ask this kind of question in Microsoft Support. I should have tried RegEx groups in programming forums first.
That said, I found that lots of RegEx groups are getting less traffic than it used to be…
That works for one particular definition of "standalone word", but what if it’s wrapped in quotes?
Help is available:
http://xkcd.com/208/
nathan_works: The issue is that people use RegEx to solve problems that RegEx can’t solve (correctly.) For example, you can’t correctly validate an email address using RegEx. If your RegEx does anything more than just checking for an @ symbol, it’s probably far too strict and rejecting perfectly valid email addresses.
The "correct" RegEx for validating an email address is something like 3 pages long– I wish I could find the link. Writing it with traditional code is much easier, quicker, and more explainable to future maintainers.
Also remember that validating an email address also includes validating a domain name, another task which is… well… painful at best to do in RegEx alone.
Oh, I should also mention that if you accept local domain email addresses, even the @ is optional. :) In theory at least, my co-workers can email me using just "james.schend" without anything else, since we’re on the same domain.
Anyway, what it boils down to is stop rejecting email addresses with "+" in them. Huge pet peeve.
Same goes for parsing IPv4 addresses. It’s not 3 pages long, but it’s a good page or so of dense gobbledygook that’s unmaintainable.
Especially when you consider how many ways you can enter an IPv4 address.
See Raymond’s blog entry "How do I write a regular expression that matches an IPv4 dotted address?" http://blogs.msdn.com/oldnewthing/archive/2006/05/22/603788.aspx