On Thu, Feb 12, 2009 at 6:20 PM, Emad Nawfal (عماد نوفل) <emadnaw...@gmail.com> wrote: > Dear Tutors, > I have syntax trees like the one below. I need to extract the membership > information like which adjective belongs to which noun phrase, and so on. In > short, I want to do something like this: > http://ilk.uvt.nl/team/sabine/chunklink/README.html > I already have the Perl script that does that, so I do not need a script. I > just want to be able to do this myself. My question is: what tools do I need > for this? Could you please give me pointers to where to start? I'll then try > to do it myself, and ask questions when I get stuck.
I guess I'm in the mood for writing parsers this week :-) Attached is a parser that uses PLY to parse the structure you provided. The result is a nested list structure with exactly the same structure as the original; it builds the list you would get by replacing all the () with [] and quoting the strings. Also in the attachment is a function that walks the resulting tree to print a table similar to the one in the chunklink reference. This is a pretty simple example of both PLY usage and recursive walking of a tree, if anyone wants to learn about either one. I hope I haven't taken all your fun :-) Kent PS to the list: I know, I'm setting a bad example. Usually we like to teach people to write Python, not write their programs for them.
''' Parse files in Penn Treebank II format See http://ilk.uvt.nl/team/sabine/chunklink/README.html and http://bulba.sdsu.edu/jeanette/thesis/PennTags.html The output is a nested list structure directly corresponding to the structure of the input, as if all the () were replaced with [] and the other text was quoted ''' from pprint import pprint from ply import lex, yacc debug = 0 text = """ ( (S (NP-SBJ-1 (NP (NNP Rudolph) (NNP Agnew) ) (, ,) (UCP (ADJP (NP (CD 55) (NNS years) ) (JJ old) ) (CC and) (NP (NP (JJ former) (NN chairman) ) (PP (IN of) (NP (NNP Consolidated) (NNP Gold) (NNP Fields) (NNP PLC) )))) (, ,) ) (VP (VBD was) (VP (VBN named) (S (NP-SBJ (-NONE- *-1) ) (NP-PRD (NP (DT a) (JJ nonexecutive) (NN director) ) (PP (IN of) (NP (DT this) (JJ British) (JJ industrial) (NN conglomerate) )))))) (. .) )) """ # Lexical tokens tokens = ( 'TOKEN', ) literals = "()" # Regular expression rules for simple tokens t_TOKEN = r'[^\(\)\s]+' # A string containing ignored characters (spaces and tabs) t_ignore = ' \t\r\n' # Error handling rule def t_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) # Build the lexer lexer = lex.lex() def p_tree(p): '''tree : '(' TOKEN TOKEN ')' | '(' TOKEN list ')' | '(' list ')' ''' if len(p) == 4: p[0] = [p[2]] else: p[0] = [p[2], p[3]] def p_list(p): '''list : tree | list tree ''' if len(p) == 2: p[0] = [p[1]] else: p[0] = p[1] + [p[2]] parser = yacc.yacc() def list_leaf_nodes(tree, path = []): ''' Print all the leaves of tree inorder with their parent tags ''' if len(tree) == 1: # Root node has only one element list_leaf_nodes(tree[0], []) else: if isinstance(tree[1], list): # Show child nodes itemPath = path + [tree[0]] for item in tree[1]: list_leaf_nodes(item, itemPath) else: # Show leaf node print '%-8s %-8s %s' % (tree[0], tree[1], format_path(path)) def format_path(path): return '/'.join(item.split('-')[0] for item in path) if __name__ == '__main__': tree = parser.parse(text, debug=debug) # pprint(tree) list_leaf_nodes(tree)
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor