On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote: > I'm looking for a more elegant way to parse sections of text files that > are bordered by BEGIN/END delimiting phrases, like this: > > some text > some more text > BEGIN_INTERESTING_BIT > someline1 > someline2 > someline3 > END_INTERESTING_BIT > more text > more text > > What I have been doing is clumsy, involving converting to a string and > slicing out the required section using split('DELIMITER'): > > import sys > infile = open(sys.argv[1], 'r') > #join list elements with @ character into a string > fileStr = '@'.join(infile.readlines()) > #Slice out the interesting section with split, then split again into > lines using @ > resultLine = > fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@') > for line in resultLine: > do things > > Can anyone point me at a better way to do this? >
Possibly over-kill, but ... How much fun are you interested in having? Others have given you the "low fun" easy way. Now ask yourself whether this task is likely to become more complex (the interesting parts more hidden in a more complex grammar) and perhaps you also can't wait to have some fun. Is so, consider this suggestion: 1. Write grammar rules that describe your input text. In your case, those rules might look something like the following: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* 2. For each rule, write a Python function that tries to recognize what the rule describes. To do its job, each function might call other functions that implement other grammar rules and might call a tokenizer function (see below) when it needs another token from the input stream. Example: def InterestingChunk_reco(self): if self.token_type == Tok_Begin: self.get_token() if self.InterestingSeq_reco(): if self.token_type == Tok_End: self.get_token() return True else: self.Error('bad interesting sequence') 3. Write a tokenizer function. Each time this function is called, it returns the next "token" (probably a word) from the input stream and a code that indicates the token type. Recognizer functions call this tokenizer function each time another token is needed. In your case there might be 3 token types: (1) plain word, (2) BeginTok, and (3) EndTok. If you do the above, you have just written your first recursive descent parser. Then, the next time you are at a party, beer bar, or wedding, any time the conversation comes even remotely close to the subject of parsing text, you say, "Well, for that kind of problem I usually write a recursive descent parser. It's the most powerful way and the only way to go. ..." Now, that's how to impress your friends and relations. But, seriously, recursive descent parsers are quite easy and are a useful technique to have in your tool bag. And, like I said above: It's fun. Besides, if your problem becomes more complex, and, for example, the input is not quite so line oriented, you will need a more powerful approach. Wikipedia has a better explanation than mine plus an example and links: http://en.wikipedia.org/wiki/Recursive_descent_parser I've attached a sample solution and sample input. Also, be aware that there are parse generators for Python. Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman
#!/usr/bin/env python # -*- mode: pymode; coding: latin1; -*- """ Recognize and print out interesting parts of input. A recursive descent parser is used to scan the input. Usage: python recursive_descent_parser.py [options] <infile> Options: -h, --help Display this help message. Example: python recursive_descent_parser.py infile Grammar: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* """ # # Imports import sys import getopt # # Globals and constants # Token types: Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5) # # Classes class InterestingParser(object): def __init__(self, infilename=None): self.current_token = None if infilename: self.infilename = infilename self.read_input() #print self.input self.get_token() def read_input(self): self.infile = open(self.infilename, 'r') self.input = [] for line in self.infile: self.input.extend(line.rstrip('\n').split(' ')) self.infile.close() self.input_iterator = iter(self.input) def parse(self): return self.Seq_reco() def get_token(self): try: token = self.input_iterator.next() except StopIteration, e: token = None self.token = token if token is None: self.token_type = Tok_EOF elif token == 'BEGIN_INTERESTING_BIT': self.token_type = Tok_Begin elif token == 'END_INTERESTING_BIT': self.token_type = Tok_End else: self.token_type = Tok_Word def Seq_reco(self): while True: if self.token_type == Tok_EOF: return True elif self.token_type == Tok_Begin: self.InterestingChunk_reco() else: self.get_token() def InterestingChunk_reco(self): if self.token_type == Tok_Begin: self.get_token() if self.InterestingSeq_reco(): if self.token_type == Tok_End: self.get_token() return True else: self.Error('bad interesting sequence') def InterestingSeq_reco(self): while True: if self.token_type == Tok_Word: print 'interesting: "%s"' % (self.token, ) self.get_token() elif self.token_type == Tok_End: return True else: msg = 'unknown token type -- token: %s token_type: %s' % ( self.token, self.token_type, ) self.Error(msg) def Error(self, msg): print msg sys.exit(1) # # Functions def test(infilename): parser = InterestingParser(infilename) parser.parse() USAGE_TEXT = __doc__ def usage(): print USAGE_TEXT sys.exit(-1) def main(): args = sys.argv[1:] try: opts, args = getopt.getopt(args, 'h', ['help', ]) except: usage() name = 'nobody' for opt, val in opts: if opt in ('-h', '--help'): usage() if len(args) != 1: usage() infilename = args[0] test(infilename) if __name__ == '__main__': #import pdb; pdb.set_trace() main()
aaa bbb ccc ddd eee BEGIN_INTERESTING_BIT fff ggg hhh iii END_INTERESTING_BIT jjj kkk BEGIN_INTERESTING_BIT ppp qqq rrr sss ttt END_INTERESTING_BIT lll mmm
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor