_____________________________ > From: Steve Willoughby <st...@alchemy.com> >To: Santosh Kumar <rhce....@gmail.com> >Cc: python mail list <tutor@python.org> >Sent: Tuesday, February 18, 2014 7:03 PM >Subject: Re: [Tutor] Regular expression - I > > >Because the regular expression <H*> means “match an angle-bracket character, >zero or more H characters, followed by a close angle-bracket character” and >your string does not match that pattern. > >This is why it’s best to check that the match succeeded before going ahead to >call group() on the result (since in this case there is no result). > > >On 18-Feb-2014, at 09:52, Santosh Kumar <rhce....@gmail.com> wrote:
You also might want to consider making it a non-greedy match. The explanation http://docs.python.org/2/howto/regex.html covers an example almost identical to yours: Greedy versus Non-Greedy When repeating a regular expression, as in a*, the resulting action is to consume as much of the pattern as possible. This fact often bites you when you’re trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. The naive pattern for matching a single HTML tag doesn’t work because of the greedy nature of .*. >>> >>> s = '<html><head><title>Title</title>' >>> len(s) 32 >>> print >>> re.match('<.*>', s).span() (0, 32) >>> print re.match('<.*>', s).group() >>> <html><head><title>Title</title> The RE matches the '<' in <html>, and the .* consumes the rest of the string. There’s still more left in the RE, though, and the > can’t match at the end of the string, so the regular expression engine has to backtrack character by character until it finds a match for the >. The final match extends from the '<' in <html> to the '>' in </title>, which isn’t what you want. In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or {m,n}?, which match as little text as possible. In the above example, the '>' is tried immediately after the first '<' matches, and when it fails, the engine advances a character at a time, retrying the '>' at every step. This produces just the right result: >>> >>> print re.match('<.*?>', s).group() <html> _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor