[Tutor] Question regular expressions - the non-greedy pattern
Hello, in the howto (http://docs.python.org/2/howto/regex.html#regex-howto) there are code examples near the end of the page (the non-greedy pattern) I am referring to: s = 'Title' >>> len(s) 32 >>> print re.match('<.*>', s).span() (0, 32) >>> print re.match('<.*>', s).group() Title print re.match('<.*?>', s).group() #<- I'm referring to this So far everything is fine. Now I'm changing the input string to (adding an extra '<'): s = '', s).group() I would expect to get the same result as I'm using the non-greedy pattern. What I get is < Did I get the concept of non-greedy wrong or is this really a bug? I've treid this with python -V Python 2.7.3 on Win 7 64 Bit as well as Ubuntu 64 bit. I'd be glad to here from you soon. Thank's a lot for your effort. best regards Marcin ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Question regular expressions - the non-greedy pattern
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello Hugo, hello Walter, first thank you very much for the quick reply. The functions used here i.e. re.match() are taken directly form the example in the mentioned HowTo. I'd rather use re.findall() but I think the general interpretetion of the given regexp sould be nearly the same in both functions. So I'd like to neglect the choise of a particular function for a moment a concentrate on the pure theory. What I got so far: in theory form s = '<Title' '<.*?>' would match '' '' '' '' to achieve this the engine should: 1. walk forward along the text until it finds < 2. walk forward from that point until in finds > 3. walk backward form that point (the one of >) until it finds < 4. return the string between < from 3. and > from 2. as this gives the least possible string between < and > Did I get this right so far? Is this (=least possible string between < and >), what non-greedy really translates to? For some reason, I did not get so far the regexp engine in Python omits step 3. and returns the string between < from 1. and > from 2. resulting in '<' Am I right? If so, is there an easily graspable reason for the engine designers to implement it this way? If I'm wrong, where is my fault? Marcin Am 21.01.2013 17:23, schrieb Walter Prins: > Hi, > > > > On 21 January 2013 14:45, Marcin Mleczko <mailto:marcin.mlec...@onet.eu>> wrote: > > Did I get the concept of non-greedy wrong or is this really a bug? > > > Hugo's already explained the essence of your problem, but just to > add/reiterate: > > a) match() will match at the beginning of the string (first > character) or not at all. As specified your regex does in fact > match from the first character as shown so the result is correct. > (Aside, "" in "<" does not in fact match *from the > beginning of the string* so is besides the point for the match() > call.) > > b) Changing your regexp so that the body of the tag *cannot* > contain "<", and then using search() instead, will fix your > specific case for you: > > import re > > s = '<Title' tag_regex = '<[^<]*?>' > > matchobj = re.match(tag_regex, s) print "re.match() result:", > matchobj # prints None since no match at start of s > > matchobj = re.search(tag_regex, s) # prints something since regex > matches at index 1 of string print "re.search() result:\n", print > "span:", matchobj.span() print "group:", matchobj.group() > > > Walter > > -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ/cs4AAoJEDAt44dGkgj1CSUH/iT7b7jKafu8ugXGlNiLtISy Abt6GcAZuwxeuokH7dna4FGA54x5BZzjrglu+VWrRJx8hsherL04Qt216V725Tpx SN4IgLtK+AYAuhI73iBvyWK51vOTkWDzLrs6DYjNEWohw+n9QEtZVEkgMej/p760 6YDs8lbrHxVqUGiFTQr+vpCb6W85sOr+RlfkBsFibC3S17wRNVtaYWITc85I5Dfr lLBh2kPzi9ITKPIFag4GRNzj1rWtp0NUGGAjyhmgijdl2GbiCLAGteJGoUvajOa1 889UuPItCi4zVJ5PJv0PDej8eD0ppd+k0rRHQK3SgaSgtTDgviGOvs3Ch4A9/Sk= =Qo8U -END PGP SIGNATURE- ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Question regular expressions - the non-greedy pattern
Now I think I got it. Thanks a lot again. Marcin Am 22.01.2013 12:00, schrieb tutor-requ...@python.org: > > Message: 1 > Date: Tue, 22 Jan 2013 11:31:01 +1100 > From: Steven D'Aprano > To: tutor@python.org > Subject: Re: [Tutor] Question regular expressions - the non-greedy > pattern > Message-ID: <50fdddc5.6030...@pearwood.info> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 22/01/13 10:11, Marcin Mleczko wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA1 >> >> Hello Hugo, hello Walter, >> >> first thank you very much for the quick reply. >> >> The functions used here i.e. re.match() are taken directly form the >> example in the mentioned HowTo. I'd rather use re.findall() but I >> think the general interpretetion of the given regexp sould be nearly >> the same in both functions. > > Regular expressions are not just "nearly" the same, they are EXACTLY > the same, in whatever re function you call, with one exception: > > re.match only matches at the start of the string. > > >> So I'd like to neglect the choise of a particular function for a >> moment a concentrate on the pure theory. >> What I got so far: >> in theory form s = '<Title' >> '<.*?>' would match'''''''' > > Incorrect. It will match > > '<' > '' > '' > '' > > > Why don't you try it and see? > > py> s = '<Title' > py> import re > py> re.findall('<.*?>', s) > ['<', '', '', ''] > > > The re module is very stable. The above is what happens in every Python > version between *at least* 1.5 and 3.3. > > >> to achieve this the engine should: >> 1. walk forward along the text until it finds< > > Correct. That matches the first "<". > > >> 2. walk forward from that point until in finds> > > Correct. That matches the first ">". > > Since the regex has now found a match, it moves on to the next part > of the regex. Since this regex is now complete, it is done, and > returns what it has found. > > >> 3. walk backward form that point (the one of>) until it finds< > > Regexes only backtrack on *misses*, not once they successfully find > a match. Once a regex has found a match, it is done. > > >> 4. return the string between< from 3. and> from 2. as this gives the >> least possible string between< and> > > Incorrect. > > >> Did I get this right so far? Is this (=least possible string between< >> and>), what non-greedy really translates to? > > No. The ".*" regex searches forward as far as possible; the ".*?" searches > forward as little as possible. They do not backtrack. > > The only time a non-greedy regex will backtrack is if the greedy version > will backtrack. Since ".*" has no reason to backtrack, neither does ".*?". > > >> For some reason, I did not get so far the regexp engine in Python >> omits step 3. and returns the string between< from 1. and> from 2. >> resulting in '<' >> >> Am I right? If so, is there an easily graspable reason for the engine >> designers to implement it this way? > > Because that's the way regexes work. You would need to learn about > regular expression theory, which is not easy material. But you can start > here: > > http://en.wikipedia.org/wiki/Regular_expression > > and for more theoretical approach: > > http://en.wikipedia.org/wiki/Chomsky_hierarchy > http://en.wikipedia.org/wiki/Regular_language > http://en.wikipedia.org/wiki/Regular_grammar > > If you don't understand all the theory, don't worry, neither do I. > > > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Question on regular expressions
Hello, given this kind of string: "start SomeArbitraryAmountOfText start AnotherArbitraryAmountOfText end" a search string like: r"start.*?end" would give me the entire string from the first "start" to "end" : "start SomeArbitraryAmountOfText start AnotherArbitraryAmountOfText end" but I am interested only in the second part between the 2nd "start" and the "end": "start AnotherArbitraryAmountOfText end" What would be best, most clever way to search for that? Or even more general: how do I exlude always the text between the last "start" and the "end" tag assuming the entire text contains several "start" tags spaced by an arbitrary amount of text befor the "end" tag? Any ideas? Thank you in advance. ;-) Marcin ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor