Why can't I get my regular expression pattern to match words that begin with %?

Match

No match

%1

%1b

:%1:

x%1

Pattern

String

Result

Expected

\b%1\b

%1

No match

Match

\b%1\b

:%1:

No match

Match

\b%1\b

x%1

Match

No match

^..$

%1

Match

Comments (20)

Marquess says:

April 23, 2010 at 7:26 am

As the old adage goes:

“Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.”
nathan_works says:

April 23, 2010 at 9:30 am

Was beaten to it by Marquess.. RegEx "work", but getting it right can be a non-trivial pain..
Maurits [MSFT] says:

April 23, 2010 at 10:21 am

I meant B%1b of course.
Blake says:

April 23, 2010 at 12:55 pm

The real problem is the common one of under-specification. Without a more clear defintion of "standalone word" in the customer’s problem domain, any proposed solution is a crapshoot.
HATE says:

April 23, 2010 at 1:27 pm

Every time I hear the word RegEx I reach for my gun.
arnshea says:

April 23, 2010 at 2:00 pm

I keep regexlib.com in my link bar just for such occasions. I do have to admit to loving regular expressions though.
Joseph Koss says:

April 23, 2010 at 4:08 pm

RegEx is very powerful.. so much so that it is easy to fall into the trap of using it.

Not Microsoft bashing here, but there was a recent security exploit for Explorer involving its dynamic generation of RegEx expressions at runtime.

You read that correctly. The dynamic generation of RegEx expressions at runtime.

I will now repeat..

RegEx is very powerful.. so much so that it is easy to fall into the trap of using it.
Maurits [MSFT] says:

April 23, 2010 at 10:19 am

Do they really want to match things like "-%1" but not "a%1"?

If so this works*: B%1 where B means "not a word boundary".

I suspect not, but I can’t make a guess as to their real requirements without knowing more about the situation.

* It works in Perl 5.8, anyway; every regular expression implementation seems to have their own rules.
Clovis says:

April 24, 2010 at 1:26 am

"A customer asked for help writing a regular expression" – wow. There isn’t enough time in the day for that sort of helpless customer. Tell them to Google for it or to switch to Linux.
Marquess says:

April 24, 2010 at 2:45 am

“Tell them to […] switch to Linux.”

Well, now they have *three* problems. How exactly does switching to Linux help with regexes?
Artem says:

April 24, 2010 at 11:54 am

"Therefore, in order for the b that comes before it to match, the character before the percent sign must be a W."

According to expression "b%1b" % matches /W so the character before the percent sign should be /w.
Clovis says:

April 25, 2010 at 12:32 am

@Marquess – "How exactly does switching to Linux help…" – by freeing up the MS customer lines for people with useful questions.
Kasper Henriksen says:

April 25, 2010 at 1:51 am

Coming from perl, what really puzzles me is why noone bothered to try /(^|s)%1($|s)/ (aaah, line noise!). Or is matching whitespace forbidden magic that should never be used under any circumstance?
Cheong says:

April 25, 2010 at 10:12 pm

Clovis: If this were paid support, then the easy questions are welcomed of course.

Btw, I’m not aware that I can ask this kind of question in Microsoft Support. I should have tried RegEx groups in programming forums first.

That said, I found that lots of RegEx groups are getting less traffic than it used to be…
Maurits [MSFT] says:

April 26, 2010 at 7:38 am

why noone bothered to try /(^|s)%1($|s)/

That works for one particular definition of "standalone word", but what if it’s wrapped in quotes?
Bulletmagnet says:

April 26, 2010 at 7:45 am

Help is available:

http://xkcd.com/208/
James Schend says:

April 26, 2010 at 8:57 am

nathan_works: The issue is that people use RegEx to solve problems that RegEx can’t solve (correctly.) For example, you can’t correctly validate an email address using RegEx. If your RegEx does anything more than just checking for an @ symbol, it’s probably far too strict and rejecting perfectly valid email addresses.

The "correct" RegEx for validating an email address is something like 3 pages long– I wish I could find the link. Writing it with traditional code is much easier, quicker, and more explainable to future maintainers.

Also remember that validating an email address also includes validating a domain name, another task which is… well… painful at best to do in RegEx alone.
James Schend says:

April 26, 2010 at 8:59 am

Oh, I should also mention that if you accept local domain email addresses, even the @ is optional. :) In theory at least, my co-workers can email me using just "james.schend" without anything else, since we’re on the same domain.

Anyway, what it boils down to is stop rejecting email addresses with "+" in them. Huge pet peeve.
Worf says:

April 26, 2010 at 10:18 pm

Same goes for parsing IPv4 addresses. It’s not 3 pages long, but it’s a good page or so of dense gobbledygook that’s unmaintainable.

Especially when you consider how many ways you can enter an IPv4 address.
Maurits [MSFT] says:

April 27, 2010 at 12:26 pm

parsing IPv4 addresses

See Raymond’s blog entry "How do I write a regular expression that matches an IPv4 dotted address?" http://blogs.msdn.com/oldnewthing/archive/2006/05/22/603788.aspx

Comments are closed.

Date:	April 23, 2010 / year-entry #119
Tags:	code
Orig Link:	https://blogs.msdn.microsoft.com/oldnewthing/20100423-00/?p=14263
Comments:	20
Summary:	A customer asked for help writing a regular expression that, in the customer's words, matched the string %1 when it appeared as a standalone word. Match No match %1 %1b :%1: x%1 One of the things that people often forget to do when asking a question is to describe the things that they tried and...

Why can’t I get my regular expression pattern to match words that begin with %?