On Thu, Feb 12, 2009 at 6:20 PM, Emad Nawfal (عماد نوفل)
<emadnaw...@gmail.com> wrote:
> Dear Tutors,
> I have syntax trees like the one below. I need to extract the membership
> information like which adjective belongs to which noun phrase, and so on. In
> short, I want to do something like this:
> http://ilk.uvt.nl/team/sabine/chunklink/README.html
> I already have the Perl script that does that, so I do not need a script. I
> just want to be able to do this myself. My question is: what tools do I need
> for this? Could you please give me pointers to where to start? I'll then try
> to do it myself, and ask questions when I get stuck.

I guess I'm in the mood for writing parsers this week :-)

Attached is a parser that uses PLY to parse the structure you
provided. The result is a nested list structure with exactly the same
structure as the original; it builds the list you would get by
replacing all the () with [] and quoting the strings.

Also in the attachment is a function that walks the resulting tree to
print a table similar to the one in the chunklink reference.

This is a pretty simple example of both PLY usage and recursive
walking of a tree, if anyone wants to learn about either one. I hope I
haven't taken all your fun :-)

Kent

PS to the list: I know, I'm setting a bad example. Usually we like to
teach people to write Python, not write their programs for them.
'''
Parse files in Penn Treebank II format
See http://ilk.uvt.nl/team/sabine/chunklink/README.html
and http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
The output is a nested list structure directly corresponding
to the structure of the input, as if all the () were replaced with []
and the other text was quoted
'''

from pprint import pprint
from ply import lex, yacc

debug = 0

text = """
( (S 
    (NP-SBJ-1 
      (NP (NNP Rudolph) (NNP Agnew) )
      (, ,) 
      (UCP 
        (ADJP 
          (NP (CD 55) (NNS years) )
          (JJ old) )
        (CC and) 
        (NP 
          (NP (JJ former) (NN chairman) )
          (PP (IN of) 
            (NP (NNP Consolidated) (NNP Gold) (NNP Fields) (NNP PLC) ))))
      (, ,) )
    (VP (VBD was) 
      (VP (VBN named) 
        (S 
          (NP-SBJ (-NONE- *-1) )
          (NP-PRD 
            (NP (DT a) (JJ nonexecutive) (NN director) )
            (PP (IN of) 
              (NP (DT this) (JJ British) (JJ industrial) (NN conglomerate) ))))))
    (. .) ))
"""

# Lexical tokens

tokens = (
   'TOKEN',
)

literals = "()"

# Regular expression rules for simple tokens
t_TOKEN = r'[^\(\)\s]+'

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t\r\n'

# Error handling rule
def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()


def p_tree(p):
    '''tree : '(' TOKEN TOKEN ')'
            | '(' TOKEN list ')'
            | '(' list ')'
    '''
    if len(p) == 4:
        p[0] = [p[2]]
    else:
        p[0] = [p[2], p[3]]

def p_list(p):
    '''list : tree
            | list tree
    '''
    if len(p) == 2:
        p[0] = [p[1]]
    else:
        p[0] = p[1] + [p[2]]
        
parser = yacc.yacc()

def list_leaf_nodes(tree, path = []):
    ''' Print all the leaves of tree inorder with their parent tags '''
    if len(tree) == 1:
        # Root node has only one element
        list_leaf_nodes(tree[0], [])
    else:
        if isinstance(tree[1], list):
            # Show child nodes
            itemPath = path + [tree[0]]
            for item in tree[1]:
                list_leaf_nodes(item, itemPath)
        else:
            # Show leaf node
            print '%-8s %-8s %s' % (tree[0], tree[1], format_path(path))

def format_path(path):
    return '/'.join(item.split('-')[0] for item in path)
    
if __name__ == '__main__':
    tree = parser.parse(text, debug=debug)
#    pprint(tree)
    list_leaf_nodes(tree)
    
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to