On Wed, 8 Dec 2004 15:11:55 +0000, Max Noel <[EMAIL PROTECTED]> wrote: > > > > On Dec 8, 2004, at 14:42, Jesse Noller wrote: > > > Hello, > > > > I'm trying to do some text processing with python on a farily large > > text file (actually, XML, but I am handling it as plaintext as all I > > need to do is find/replace/move) and I am having problems with trying > > to identify two lines in the text file, and remove everything in > > between those two lines (but not the two lines) and then write the > > file back (I know the file IO part). > > Okay, here are some hints: you need to identify when you enter a <foo> > block and when you exit a </foo> block, keeping in mind that this may > happen on the same line (e.g. <foo>blah</foo>). The rest is trivial. > The rest of your message is included as a spoiler space if you want to > find the solution by yourself -- however, a 17-line program that does > that is included at the end of this message. It prints the resulting > file to the standard out, for added flexibility: if you want the result > to be in a file, just redirect stdout (python blah.py > out.txt). > > Oh, one last thing: don't use readlines(), it uses up a lot of memory > (especially with big files), and you don't need it since you're reading > the file sequentially. Use the file iterator instead. > > > > > I'm trying to do this with the re module - the two tags looks like: > > > > <foo> > > ... > > a bunch of text (~1500 lines) > > ... > > </foo> > > > > I need to identify the first tag, and the second, and unconditionally > > strip out everything in between those two tags, making it look like: > > > > <foo> > > </foo> > > > > I'm familiar with using read/readlines to pull the file into memory > > and alter the contents via string.replace(str, newstr) but I am not > > sure where to begin with this other than the typical open/readlines. > > > > I'd start with something like: > > > > re1 = re.compile('^\<foo\>') > > re2 = re.compile('^\<\/foo\>') > > > > f = open('foobar.txt', 'r') > > for lines in f.readlines() > > match = re.match(re1, line) > > > > But I'm lost after this point really, as I can identify the two lines, > > but I am not sure how to do the processing. > > > > thank you > > -jesse > > _______________________________________________ > > Tutor maillist - [EMAIL PROTECTED] > > http://mail.python.org/mailman/listinfo/tutor > > #!/usr/bin/env python > > import sre > > reStart = sre.compile('^\s*\<foo\>') > reEnd = sre.compile('\</foo\>\s*$') > > inBlock = False > > fileSource = open('foobar.txt') > > for line in fileSource: > if reStart.match(line): inBlock = True > if not inBlock: print line > if reEnd.match(line): inBlock = False > > fileSource.close() > > -- Max > maxnoel_fr at yahoo dot fr -- ICQ #85274019 > "Look at you hacker... A pathetic creature of meat and bone, panting > and sweating as you run through my corridors... How can you challenge a > perfect, immortal machine?" > >
Thanks a bunch for all of your fast responses, they helped a lot - I'll post what I cook up back to the list as soon as I complete it. Thanks! -jesse _______________________________________________ Tutor maillist - [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/tutor