Re: [Tutor] Regular expression - I

Albert-Jan Roskam Tue, 18 Feb 2014 12:05:14 -0800


_____________________________
> From: Steve Willoughby <st...@alchemy.com>
>To: Santosh Kumar <rhce....@gmail.com> 
>Cc: python mail list <tutor@python.org> 
>Sent: Tuesday, February 18, 2014 7:03 PM
>Subject: Re: [Tutor] Regular expression - I
> 
>
>Because the regular expression <H*> means “match an angle-bracket character, 
>zero or more H characters, followed by a close angle-bracket character” and 
>your string does not match that pattern.
>
>This is why it’s best to check that the match succeeded before going ahead to 
>call group() on the result (since in this case there is no result).
>
>
>On 18-Feb-2014, at 09:52, Santosh Kumar <rhce....@gmail.com> wrote:



You also might want to consider making it a non-greedy match. The explanation 
http://docs.python.org/2/howto/regex.html covers an example almost identical to 
yours:

Greedy versus Non-Greedy
When repeating a regular expression, as in a*, the resulting action is to
consume as much of the pattern as possible.  This fact often bites you when
you’re trying to match a pair of balanced delimiters, such as the angle brackets
surrounding an HTML tag.  The naive pattern for matching a single HTML tag
doesn’t work because of the greedy nature of .*.
>>>
>>> s = '<html><head><title>Title</title>' >>> len(s) 32 >>> print 
>>> re.match('<.*>', s).span() (0, 32) >>> print re.match('<.*>', s).group() 
>>> <html><head><title>Title</title> 
The RE matches the '<' in <html>, and the .* consumes the rest of
the string.  There’s still more left in the RE, though, and the > can’t
match at the end of the string, so the regular expression engine has to
backtrack character by character until it finds a match for the >.   The
final match extends from the '<' in <html> to the '>' in </title>, which isn’t 
what you want.
In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or 
{m,n}?, which match as little text as possible.  In the above
example, the '>' is tried immediately after the first '<' matches, and
when it fails, the engine advances a character at a time, retrying the '>' at 
every step.  This produces just the right result:
>>>
>>> print re.match('<.*?>', s).group() <html> 
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Regular expression - I

Reply via email to