On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote:
> I'm looking for a more elegant way to parse sections of text files that 
> are bordered by BEGIN/END delimiting phrases, like this:
> 
> some text
> some more text
> BEGIN_INTERESTING_BIT
> someline1
> someline2
> someline3
> END_INTERESTING_BIT
> more text
> more text
> 
> What I have been doing is clumsy, involving converting to a string and 
> slicing out the required section using split('DELIMITER'): 
> 
> import sys
> infile = open(sys.argv[1], 'r')
> #join list elements with @ character into a string
> fileStr = '@'.join(infile.readlines())
> #Slice out the interesting section with split, then split again into 
> lines using @
> resultLine = 
> fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
> for line in resultLine:
>     do things
> 
> Can anyone point me at a better way to do this?
> 

Possibly over-kill, but ...

How much fun are you interested in having?  Others have given you
the "low fun" easy way.  Now ask yourself whether this task is
likely to become more complex (the interesting parts more hidden in
a more complex grammar) and perhaps you also can't wait to have
some fun.  Is so, consider this suggestion:

1. Write grammar rules that describe your input text.  In your
   case, those rules might look something like the following:

       Seq ::= {InterestingChunk | UninterestingChunk}*
       InterestingChunk ::= BeginToken InterestingSeq EndToken
       InterestingSeq ::= InterestingChunk*


2. For each rule, write a Python function that tries to recognize
   what the rule describes.  To do its job, each function might
   call other functions that implement other grammar rules and
   might call a tokenizer function (see below) when it needs
   another token from the input stream.  Example:

       def InterestingChunk_reco(self):
           if self.token_type == Tok_Begin:
               self.get_token()
               if self.InterestingSeq_reco():
                   if self.token_type == Tok_End:
                       self.get_token()
                       return True
                   else:
                       self.Error('bad interesting sequence')

3. Write a tokenizer function.  Each time this function is called,
   it returns the next "token" (probably a word) from the input
   stream and a code that indicates the token type.  Recognizer
   functions call this tokenizer function each time another token
   is needed.  In your case there might be 3 token types: (1) plain
   word, (2) BeginTok, and (3) EndTok.

If you do the above, you have just written your first recursive
descent parser.

Then, the next time you are at a party, beer bar, or wedding, any
time the conversation comes even remotely close to the subject of
parsing text, you say, "Well, for that kind of problem I usually
write a recursive descent parser.  It's the most powerful way and
the only way to go.  ..." Now, that's how to impress your friends
and relations.

But, seriously, recursive descent parsers are quite easy and are a
useful technique to have in your tool bag.  And, like I said above:
It's fun.

Besides, if your problem becomes more complex, and, for example,
the input is not quite so line oriented, you will need a more
powerful approach.

Wikipedia has a better explanation than mine plus an example and
links: http://en.wikipedia.org/wiki/Recursive_descent_parser

I've attached a sample solution and sample input.

Also, be aware that there are parse generators for Python.

Dave


-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
#!/usr/bin/env python
# -*- mode: pymode; coding: latin1; -*-
"""
Recognize and print out interesting parts of input.
A recursive descent parser is used to scan the input.

Usage:
    python recursive_descent_parser.py [options] <infile>
Options:
    -h, --help      Display this help message.
Example:
    python recursive_descent_parser.py infile

Grammar:
    Seq ::= {InterestingChunk | UninterestingChunk}*
    InterestingChunk ::= BeginToken InterestingSeq EndToken
    InterestingSeq ::= InterestingChunk*
"""


#
# Imports

import sys
import getopt


#
# Globals and constants

# Token types:
Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5)


#
# Classes

class InterestingParser(object):
    def __init__(self, infilename=None):
        self.current_token = None
        if infilename:
            self.infilename = infilename
            self.read_input()
            #print self.input
            self.get_token()
    def read_input(self):
        self.infile = open(self.infilename, 'r')
        self.input = []
        for line in self.infile:
            self.input.extend(line.rstrip('\n').split(' '))
        self.infile.close()
        self.input_iterator = iter(self.input)
    def parse(self):
        return self.Seq_reco()
    def get_token(self):
        try:
            token = self.input_iterator.next()
        except StopIteration, e:
            token = None
        self.token = token
        if token is None:
            self.token_type = Tok_EOF
        elif token == 'BEGIN_INTERESTING_BIT':
            self.token_type = Tok_Begin
        elif token == 'END_INTERESTING_BIT':
            self.token_type = Tok_End
        else:
            self.token_type = Tok_Word
    def Seq_reco(self):
        while True:
            if self.token_type == Tok_EOF:
                return True
            elif self.token_type == Tok_Begin:
                self.InterestingChunk_reco()
            else:
                self.get_token()
    def InterestingChunk_reco(self):
        if self.token_type == Tok_Begin:
            self.get_token()
            if self.InterestingSeq_reco():
                if self.token_type == Tok_End:
                    self.get_token()
                    return True
                else:
                    self.Error('bad interesting sequence')
    def InterestingSeq_reco(self):
        while True:
            if self.token_type == Tok_Word:
                print 'interesting: "%s"' % (self.token, )
                self.get_token()
            elif self.token_type == Tok_End:
                return True
            else:
                msg = 'unknown token type -- token: %s  token_type: %s' % (
                    self.token, self.token_type, )
                self.Error(msg)
    def Error(self, msg):
        print msg
        sys.exit(1)


#
# Functions

def test(infilename):
    parser = InterestingParser(infilename)
    parser.parse()


USAGE_TEXT = __doc__

def usage():
    print USAGE_TEXT
    sys.exit(-1)


def main():
    args = sys.argv[1:]
    try:
        opts, args = getopt.getopt(args, 'h', ['help', ])
    except:
        usage()
    name = 'nobody'
    for opt, val in opts:
        if opt in ('-h', '--help'):
            usage()
    if len(args) != 1:
        usage()
    infilename = args[0]
    test(infilename)


if __name__ == '__main__':
    #import pdb; pdb.set_trace()
    main()

aaa bbb ccc

ddd eee
BEGIN_INTERESTING_BIT fff ggg
hhh iii END_INTERESTING_BIT jjj kkk
BEGIN_INTERESTING_BIT
ppp qqq rrr
sss ttt
END_INTERESTING_BIT
lll mmm
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to