On Fri, Feb 13, 2009 at 10:20 AM, Paul McGuire <pt...@austin.rr.com> wrote:
> Pyparsing has a built-in helper called nestedExpr that fits neatly in with > this data. Here is the whole script: > > from pyparsing import nestedExpr > > syntax_tree = nestedExpr() > results = syntax_tree.parseString(st_data) > > from pprint import pprint > pprint(results.asList()) > > > Prints: > > [[['S', > ['NP-SBJ-1', > ['NP', ['NNP', 'Rudolph'], ['NNP', 'Agnew']], > [',', ','], > ['UCP', > ['ADJP', ['NP', ['CD', '55'], ['NNS', 'years']], ['JJ', 'old']], > ['CC', 'and'], > ['NP', > ['NP', ['JJ', 'former'], ['NN', 'chairman']], > ['PP', > ['IN', 'of'], > ['NP', > ['NNP', 'Consolidated'], > ['NNP', 'Gold'], > ['NNP', 'Fields'], > ['NNP', 'PLC']]]]], > [',', ',']], > ['VP', > ['VBD', 'was'], > ['VP', > ['VBN', 'named'], > ['S', > ['NP-SBJ', ['-NONE-', '*-1']], > ['NP-PRD', > ['NP', ['DT', 'a'], ['JJ', 'nonexecutive'], ['NN', 'director']], > ['PP', > ['IN', 'of'], > ['NP', > ['DT', 'this'], > ['JJ', 'British'], > ['JJ', 'industrial'], > ['NN', 'conglomerate']]]]]]], > ['.', '.']]]] > > If you want to delve deeper into this, you could, since the content of the > () groups is so regular. You in essence reconstruct nestedExpr in your own > code, but you do get some increased control and visibility to the parsed > content. > > Since this is a recursive syntax, you will need to use pyparsing's > mechanism > for recursion, which is the Forward class. Forward is sort of a "I can't > define the whole thing yet, just create a placeholder" placeholder. > > syntax_element = Forward() > LPAR,RPAR = map(Suppress,"()") > syntax_tree = LPAR + syntax_element + RPAR > > Now in your example, a syntax_element can be one of 4 things: > - a punctuation mark, twice > - a syntax marker followed by one or more syntax_trees > - a syntax marker followed by a word > - a syntax tree > > Here is how I define those: > > marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD " > "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- " > "IN NP-SBJ S") > punc = oneOf(", . ! ?") > > wordchars = printables.replace("(","").replace(")","") > > syntax_element << ( > punc + punc | > marker + OneOrMore(Group(syntax_tree)) | > marker + Word(wordchars) | > syntax_tree ) > > Note that we use '<<' operator to "inject" the definition of a > syntax_element - we can't use '=' or we would get a different expression > than the one we used to define syntax_tree. > > Now parse the string, and voila! Same as before. > > Here is the entire script: > > from pyparsing import nestedExpr, Suppress, oneOf, Forward, OneOrMore, > Word, > printables, Group > > syntax_element = Forward() > LPAR,RPAR = map(Suppress,"()") > syntax_tree = LPAR + syntax_element + RPAR > > marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD " > "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- " > "IN NP-SBJ S") > punc = oneOf(", . ! ?") > > wordchars = printables.replace("(","").replace(")","") > > syntax_element << ( > punc + punc | > marker + OneOrMore(Group(syntax_tree)) | > marker + Word(wordchars) | > syntax_tree ) > > results = syntax_tree.parseString(st_data) > from pprint import pprint > pprint(results.asList()) > > -- Paul > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > Thank you so much Paul, Kent, and Hoftkamp. I was asking what the right tools were, and I got two fully-functional scripts back. Much more than I had expected. I'm planning to use these scripts instead of the Perl one. I've also started with PyParsing as it seems to be a little easier to understand than PLY. Thank you again, -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد الغزالي "No victim has ever been more repressed and alienated than the truth" Emad Soliman Nawfal Indiana University, Bloomington --------------------------------------------------------
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor