On Dec 8, 2004, at 14:42, Jesse Noller wrote:

Hello,

I'm trying to do some text processing with python on a farily large
text file (actually, XML, but I am handling it as plaintext as all I
need to do is find/replace/move) and I am having problems with trying
to identify two lines in the text file, and remove everything in
between those two lines (but not the two lines) and then write the
file back (I know the file IO part).

Okay, here are some hints: you need to identify when you enter a <foo> block and when you exit a </foo> block, keeping in mind that this may happen on the same line (e.g. <foo>blah</foo>). The rest is trivial.
The rest of your message is included as a spoiler space if you want to find the solution by yourself -- however, a 17-line program that does that is included at the end of this message. It prints the resulting file to the standard out, for added flexibility: if you want the result to be in a file, just redirect stdout (python blah.py > out.txt).


Oh, one last thing: don't use readlines(), it uses up a lot of memory (especially with big files), and you don't need it since you're reading the file sequentially. Use the file iterator instead.

I'm trying to do this with the re module - the two tags looks like:

<foo>
    ...
    a bunch of text (~1500 lines)
    ...
</foo>

I need to identify the first tag, and the second, and unconditionally
strip out everything in between those two tags, making it look like:

<foo>
</foo>

I'm familiar with using read/readlines to pull the file into memory
and alter the contents via string.replace(str, newstr) but I am not
sure where to begin with this other than the typical open/readlines.

I'd start with something like:

re1 = re.compile('^\<foo\>')
re2 = re.compile('^\<\/foo\>')

f = open('foobar.txt', 'r')
for lines in f.readlines()
    match = re.match(re1, line)

But I'm lost after this point really, as I can identify the two lines,
but I am not sure how to do the processing.

thank you
-jesse
_______________________________________________
Tutor maillist  -  [EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/tutor


#!/usr/bin/env python

import sre

reStart = sre.compile('^\s*\<foo\>')
reEnd = sre.compile('\</foo\>\s*$')

inBlock = False

fileSource = open('foobar.txt')

for line in fileSource:
    if reStart.match(line): inBlock = True
    if not inBlock: print line
    if reEnd.match(line): inBlock = False

fileSource.close()



-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine?"


_______________________________________________
Tutor maillist  -  [EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to