Nicolas Fleury <nidoizo at yahoo.com> wrote: > > ottrey at py.redsoft.be wrote: > >>>>import re2 > >>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' > >>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' > >>>>pat2=re2.compile(regex) > >>>>x=pat2.extract(buf) > >>>>x > > > > {'verse': [{'number': '12', 'activity': 'drummers > > drumming'}, {'number': '11', 'activity': 'pipers > > piping'}, {'number': '10', 'activity': 'lords a-leaping'}]} > > Is a dictionary the good container or should another class be used? > Because in the example the content of the "verse" group is lost, > excluding its sub-groups. Something like a hierarchic MatchObject could > provide access to both information, the sub-groups and the group itself.
Yes, very good point. Actually it ~is~ a container (that uses dict as it's base class). (I probably should add the following lines to the example.) >>> type(x) <class 're2._Match'> >>> x._value '12 drummers drumming, 11 pipers piping, 10 lords a-leaping' >>> x.verse[0]._value '12 drummers drumming' Josiah Carlson jcarlson at uci.edu wrote: > If one wanted to match the API of the re module, one should use > pat2.findall(buf), which would return a list of 'hierarchical match > objects' Well, that would be something I'd want to discuss here. As I'm not sure if I actually ~want~ to match the API of the re module. > Also, should it be limited to named groups? I have given that some thought as well. Internally un-named groups are recursively given the names _group0, _group1 etc as they are found. And then those groups are recursively matched. And in the final step the resulting _Match object is compressed and those un-named groups are discarded. IMO If you don't bother to name a group then you probably aren't going to be interested in it anyway - so why keeping a reference to it? eg. If you only wanted to extract the numbers from those verses... >>> regex='^(((?P<number>\d+) ([^,]+))(, )?)*$' >>> pat2=re2.compile(regex) >>> x=pat2.extract(buf) >>> x {'number': ['12', '11', '10']} Before the compression stage the _Match object actually looked like this: {'_group0': {'_value': '12 drummers drumming, 11 pipers piping, 10 lords a-leaping', '_group0': [{'_value': '12 drummers drumming, ', '_group1': ', ', '_group0': {'_value': '12 drummers drumming', '_group1': 'drummers drumming', 'number': '12'}}, {'_value': '11 pipers piping, ', '_group1': ', ', '_group0': {'_value': '11 pipers piping', '_group1': 'pipers piping', 'number': '11'}}, {'_value': '10 lords a-leaping', '_group0': {'_value': '10 lords a-leaping', '_group1': 'lords a-leaping', 'number': '10'}}]}} But the compression algorithm collected the named groups and brought them to the surface, to return the much nicer looking: {'number': ['12', '11', '10']} NB. There are also a few other tricks up the sleeve of re2. eg. It allows for named groups to be repeated in different branches of a named group hierarchy, without the name redefinition error that the re library will complain about. eg. >>> pat1=re2.compile( '(?P<parents>(?P<mother>(?P<name>[\w ]+)),(?P<father>(?P<name>[\w ]+)))' ) >>> pat1.extract('Mum,Dad') {'parents': {'father': {'name': 'Dad'}, 'mother': {'name': 'Mum'}}} > I find the feature very interesting, but being used to live without it, > I have difficulty evaluating its usefulness. Yes - this is a good point too, because it ~is~ different from the re library. re2 aims to do all that searching, grouping, iterating and collecting and constructing work for you. > However, it reminds me how much at first I found strange that only the > last match was kept, so I think, FWIW, that on a purist point of vue the > functionality would make sense in the stdlib in some way or another. Actually that "last match only" confusion was part of the motivation for writing it in the first place. > For .verse[1] or .verse[2] to make sense, it implies that the pattern is > something like... > ((?P<verse>... )(?P<verse>...)) > ... which it isn't. Good pickup! You've seen through my smoke and mirrors. ;-) That list of verses was actually created in the compression stage. (The stage that I failed to mention in my first post.) ie. The regex was: ((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)* Which returns an un-named list of verse groups. Something like: {'_group0': [ {'verse': {'number': '12', 'activity': 'drummers drumming'}, {'verse': {'number': '11', 'activity': 'pipers piping'}}, {'verse': {'number': '10', 'activity': 'lords a-leaping'}}]} But the compression algorithm discarded that '_group0' key and brought the 'verse' groups to the surface, then grouped them together in one 'verse' list. ie. to make: {'verse': [{'number': '12', 'activity': 'drummers drumming'}, {'number': '11', 'activity': 'pipers piping'}, {'number': '10', 'activity': 'lords a-leaping'}]} > > Also, should it be limited to named groups? > > Probably not. I would suggest using matchobj.group(i) semantics to > match the standard re module semantics, though only allow returning > items in the current level of the hierarchy. That is, one could use > x.verse.group(1) and get back '12', but x.group(1) would return '12 > pipers piping' Actually, I ~would~ like to limit it to just named groups. I reckon, if you're not going to bother naming a group, then why would you have any interest in it. I guess its up for discussion how confusing this "new" way of thinking could be and what drawbacks it might have. Regards. Chris. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com