[Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Marcin Mleczko
Hello,

in the howto (http://docs.python.org/2/howto/regex.html#regex-howto)
there are code examples near the end of the page  (the non-greedy
pattern) I am referring to:

s = 'Title'
>>> len(s)
32
>>> print re.match('<.*>', s).span()
(0, 32)
>>> print re.match('<.*>', s).group()
Title

print re.match('<.*?>', s).group() #<- I'm referring to this


So far everything is fine.

Now I'm changing the input string to (adding an extra '<'):

s = '', s).group()
I would expect to get the same result



as I'm using the non-greedy pattern. What I get is

<

Did I get the concept of non-greedy wrong or is this really a bug?
I've treid this with
python -V
Python 2.7.3
on Win 7 64 Bit as well as Ubuntu 64 bit.

I'd be glad to here from you soon.

Thank's a lot for your effort.

best regards

Marcin
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Hugo Arts
On Mon, Jan 21, 2013 at 3:45 PM, Marcin Mleczko wrote:

> Now I'm changing the input string to (adding an extra '<'):
>
> s = '
> and evoking the last command again:
>
> print re.match('<.*?>', s).group()
> I would expect to get the same result
>
> 
>
> as I'm using the non-greedy pattern. What I get is
>
> <
>
> Did I get the concept of non-greedy wrong or is this really a bug?
>

No, this is not a bug. Note first that you are using re.match, which only
tries to match from the beginning of the string. If you want to match
anywhere inside the string, you should use re.search, which returns the
first match found. However even re.search will still return '<' since
that *is* a valid match of the regular expression  '<.*?>', and re.search
returns the first match it finds.

in essence, re.search first tries calling match(regex, s), then
match(regex, s[1:]), then match(regex, s[2:]) and so on and so on, moving
on one character at the time until the regular expression produces a match.
Since the regex produces a match on the first character, matching on the
second isn't even tried.

It is true that non-greedy matching will try to match the fewest number of
characters possible. However, it will not cause the regular expression
engine to backtrack, i.e. go back on parts of the pattern already matched
and match them elsewhere to try and see if that produces a shorter match.
If a greedy variant of a regex matches, then the non-greedy variant *will*
also match at the same place. The only difference is the length of the
result.

more generally, regexes can not parse HTML fully since they simply lack the
power. HTML is just not a regular language. If you want to parse arbitrary
HTML documents, or even sufficiently complex HTML documents you should get
a real HTML parser library (python includes one, check the docs). If you
just want to grab some data from HTML tags it's probably ok to use regexes
though, if you're careful.

HTH,
Hugo
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Walter Prins
Hi,



On 21 January 2013 14:45, Marcin Mleczko  wrote:

> Did I get the concept of non-greedy wrong or is this really a bug?


Hugo's already explained the essence of your problem, but just to
add/reiterate:

a) match() will match at the beginning of the string (first character) or
not at all.  As specified your regex does in fact match from the first
character as shown so the result is correct.  (Aside, "" in "<"
does not in fact match *from the beginning of the string* so is besides the
point for the match() call.)

b) Changing your regexp so that the body of the tag *cannot* contain "<",
and then using search() instead, will fix your specific case for you:

import re

s = '

Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Steven D'Aprano

On 22/01/13 01:45, Marcin Mleczko wrote:


Now I'm changing the input string to (adding an extra '<'):

s = '', s).group()
I would expect to get the same result



as I'm using the non-greedy pattern. What I get is

<

Did I get the concept of non-greedy wrong or is this really a bug?



Definitely not a bug.


Your regex says:

"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."

So it matches the "<" at the beginning of the string, followed by the
"".


To get the result you are after, you could do this:

# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)


# Skip the first character
re.match('<.*?>', s[1:]).group()


# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()


Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.



By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.




--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Marcin Mleczko
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Hugo, hello Walter,

first thank you very much for the quick reply.

The functions used here i.e. re.match() are taken directly form the
example in the mentioned HowTo. I'd rather use re.findall() but I
think the general interpretetion of the given regexp sould be nearly
the same in both functions.

So I'd like to neglect the choise of a particular function for a
moment a concentrate on the pure theory.
What I got so far:
in theory form s = '' would match '' '' '' ''
to achieve this the engine should:
1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. walk backward form that point (the one of >) until it finds <
4. return the string between < from 3. and > from 2. as this gives the
least possible string between < and >

Did I get this right so far? Is this (=least possible string between <
and >), what non-greedy really translates to?

For some reason, I did not get so far the regexp engine in Python
omits step 3. and returns the string between < from 1. and > from 2.
resulting in '<'

Am I right? If so, is there an easily graspable reason for the engine
designers to implement it this way?

If I'm wrong, where is my fault?

Marcin

Am 21.01.2013 17:23, schrieb Walter Prins:
> Hi,
> 
> 
> 
> On 21 January 2013 14:45, Marcin Mleczko  > wrote:
> 
> Did I get the concept of non-greedy wrong or is this really a bug?
> 
> 
> Hugo's already explained the essence of your problem, but just to 
> add/reiterate:
> 
> a) match() will match at the beginning of the string (first
> character) or not at all.  As specified your regex does in fact
> match from the first character as shown so the result is correct.
> (Aside, "" in "<" does not in fact match *from the
> beginning of the string* so is besides the point for the match()
> call.)
> 
> b) Changing your regexp so that the body of the tag *cannot*
> contain "<", and then using search() instead, will fix your
> specific case for you:
> 
> import re
> 
> s = ' 
> matchobj = re.match(tag_regex, s) print "re.match() result:",
> matchobj # prints None since no match at start of s
> 
> matchobj = re.search(tag_regex, s) # prints something since regex
> matches at index 1 of string print "re.search() result:\n", print
> "span:", matchobj.span() print "group:", matchobj.group()
> 
> 
> Walter
> 
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJQ/cs4AAoJEDAt44dGkgj1CSUH/iT7b7jKafu8ugXGlNiLtISy
Abt6GcAZuwxeuokH7dna4FGA54x5BZzjrglu+VWrRJx8hsherL04Qt216V725Tpx
SN4IgLtK+AYAuhI73iBvyWK51vOTkWDzLrs6DYjNEWohw+n9QEtZVEkgMej/p760
6YDs8lbrHxVqUGiFTQr+vpCb6W85sOr+RlfkBsFibC3S17wRNVtaYWITc85I5Dfr
lLBh2kPzi9ITKPIFag4GRNzj1rWtp0NUGGAjyhmgijdl2GbiCLAGteJGoUvajOa1
889UuPItCi4zVJ5PJv0PDej8eD0ppd+k0rRHQK3SgaSgtTDgviGOvs3Ch4A9/Sk=
=Qo8U
-END PGP SIGNATURE-
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Steven D'Aprano

On 22/01/13 10:11, Marcin Mleczko wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Hugo, hello Walter,

first thank you very much for the quick reply.

The functions used here i.e. re.match() are taken directly form the
example in the mentioned HowTo. I'd rather use re.findall() but I
think the general interpretetion of the given regexp sould be nearly
the same in both functions.


Regular expressions are not just "nearly" the same, they are EXACTLY
the same, in whatever re function you call, with one exception:

re.match only matches at the start of the string.



So I'd like to neglect the choise of a particular function for a
moment a concentrate on the pure theory.
What I got so far:
in theory form s = '' would match


Incorrect. It will match

'<'
''
''
''


Why don't you try it and see?

py> s = ' import re
py> re.findall('<.*?>', s)
['<', '', '', '']


The re module is very stable. The above is what happens in every Python
version between *at least* 1.5 and 3.3.



to achieve this the engine should:
1. walk forward along the text until it finds<


Correct. That matches the first "<".



2. walk forward from that point until in finds>


Correct. That matches the first ">".

Since the regex has now found a match, it moves on to the next part
of the regex. Since this regex is now complete, it is done, and
returns what it has found.



3. walk backward form that point (the one of>) until it finds<


Regexes only backtrack on *misses*, not once they successfully find
a match. Once a regex has found a match, it is done.



4. return the string between<  from 3. and>  from 2. as this gives the
least possible string between<  and>


Incorrect.



Did I get this right so far? Is this (=least possible string between<
and>), what non-greedy really translates to?


No. The ".*" regex searches forward as far as possible; the ".*?" searches
forward as little as possible. They do not backtrack.

The only time a non-greedy regex will backtrack is if the greedy version
will backtrack. Since ".*" has no reason to backtrack, neither does ".*?".



For some reason, I did not get so far the regexp engine in Python
omits step 3. and returns the string between<  from 1. and>  from 2.
resulting in '<'

Am I right? If so, is there an easily graspable reason for the engine
designers to implement it this way?


Because that's the way regexes work. You would need to learn about
regular expression theory, which is not easy material. But you can start
here:

http://en.wikipedia.org/wiki/Regular_expression

and for more theoretical approach:

http://en.wikipedia.org/wiki/Chomsky_hierarchy
http://en.wikipedia.org/wiki/Regular_language
http://en.wikipedia.org/wiki/Regular_grammar

If you don't understand all the theory, don't worry, neither do I.



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Walter Prins
Hi Marcin,


On 21 January 2013 23:11, Marcin Mleczko  wrote:

> first thank you very much for the quick reply.
>
No problem...


> The functions used here i.e. re.match() are taken directly form the
> example in the mentioned HowTo. I'd rather use re.findall() but I
> think the general interpretetion of the given regexp sould be nearly
> the same in both functions.
>

... except that the results are fundamentally different due to the
different goals for the 2 functions: the one (match) only matches a regex
from the first character of a string.  (No conceptual "walking forward"
unless you've managed to match the string to a regex.)  The other (find),
matches the first possible match (conceptually walking the starting point
forward only as far as necessary to find a possible match.)


> So I'd like to neglect the choise of a particular function for a
> moment a concentrate on the pure theory.
> What I got so far:
> in theory form s = ' '<.*?>' would match '' '' '' ''
> to achieve this the engine should:
> 1. walk forward along the text until it finds <
> 2. walk forward from that point until in finds >
>

Here, conceptually the regex engines work for your original regex is
complete and it returns a match.


> 3. walk backward form that point (the one of >) until it finds <
>

No.  No further walking backward when you've already matched the regex.

4. return the string between < from 3. and > from 2. as this gives the
> least possible string between < and >
>

"Non greedy" doesn't imply the conceptually altering the starting point in
a backwards manner after you've already found a match.


> Did I get this right so far? Is this (=least possible string between <
> and >), what non-greedy really translates to?
>

No, as explained above.

Walter
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Lie Ryan

On 22/01/13 10:11, Marcin Mleczko wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Hugo, hello Walter,

first thank you very much for the quick reply.

The functions used here i.e. re.match() are taken directly form the
example in the mentioned HowTo. I'd rather use re.findall() but I
think the general interpretetion of the given regexp sould be nearly
the same in both functions.

So I'd like to neglect the choise of a particular function for a
moment a concentrate on the pure theory.
What I got so far:
in theory form s = '' would match '' '' '' ''
to achieve this the engine should:
1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. walk backward form that point (the one of >) until it finds <
4. return the string between < from 3. and > from 2. as this gives the
least possible string between < and >

Did I get this right so far? Is this (=least possible string between <
and >), what non-greedy really translates to?


Nope, that's not how regex works. In particular, it does not do step 3, 
non-greedy regex do not walk backward to find the shortest string that 
matches.


What it does for the non-greedy pattern <.*?> is:

1. walk forward along the text until it finds <
2. if the next character matches . (any chars) then step one character 
forward, otherwise backtrack from where we left off in the previous step
3. if the next char is >, then we find a match, otherwise backtrack from 
where we left off in the previous step

4. return the string between < from 1 and > from 3

or alternatively

1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. return the string between < from 1 and > from 2

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Question regular expressions - the non-greedy pattern

2013-01-21 Thread Alan Gauld

Your understanding is flawed.

Try these actions and ponder their significance...

>>> s = '>> re.search('<.*?>',s).group()
'<'
>>> re.search('<.*>',s).group()
'>> s = '>> re.search('<.*>',s).group()
'<'
>>> re.search('<.*t',s).group()
'>> re.search('<.*?t',s).group()
'<>>


The non greediness applies to what follows the '?' not what precedes it.
That is, it swallows everything up until the last possible matching 
pattern by default. Non greedy only swallows up to the first matching 
pattern after the ?


HTH

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor