| I have about 150 lines of python extracting text from large file, the
| problem I need a few lines to clean first to avoid the problem the
| script is facing

Hello,

This seems like a well laid out task. If you post what you are trying and the 
problems you are encountering, that would be helpful.

One suggestion that I have is that you switch problems 1 and 2. If the ordering 
is broken (e.g. HHFR instead of HFRH) then knowing where to put the 
parenthetical comment is going to be a problem.  Also, you said that you wanted 
it put after the "F" reference did you mean that is should look like this:

| AFTER your process
|| H 00100 "a friend in need is a friend indeed so select the best
|| friend 
| as soon as you can blah"
|| F Old London book (xyz blah words)  <=== parenthetical here?
|| R Cool

It's a little hard to tell from what you've said, but it looks like the "|" was 
an unnecessary addition. If your record markers were always a single character 
at the beginning of a line, those are easy enough to find--provided there is 
never an H, F, or R that is a NON-record marker at the beginning of a line as a 
single character.

######
>>> text='''H This is the start.
... F here is a reference. 
... Right here is a non-reference R but it's not a single character starting 
the line
... so it won't be matched; and the single one in the middle isn't at the start.
... R cool'''
>>> import re
>>> text = '\n'+text     #make the first one like all the others: preceded by 
>>> newline character
>>> re.findall(r'\n([HFR])\b', text)
['H', 'F', 'R']
>>> re.split(r'\n([HFR])\b', text)
['', 'H', ' This is the start.', 'F', " here is a reference. \nRight here is a 
non-reference R but it's not a single character starting the line\nso it won't 
be matched; and the single one in the middle isn't at the start.", 'R', ' cool']

######

That last list has all the groups with the identifier preceding the 
corresponding data.

Finally, I'm not sure how you are checking the correctness of the HFR sequence, 
but the findall used above suggests a way to do it:

-do the findall
-join the results together
-replace 'HFR' with '.'
-if the whole string isn't dots then there was a problem and the number of dots 
before the non-dot tell you how many correct records there were.

######
>>> bad='''
... H
... F
... R
... R
... '''
>>> re.findall(r'\n([HFR])\b', bad)
['H', 'F', 'R', 'R']
>>> ''.join(_)            # the _ refers to the last output
'HFRR'
>>> _.replace('HFR', '.')
'.R'
>>> len(_),_.count('.')
(2, 1)

######

Notice that since not all the HFRs were complete, there are not all the 
characters are periods (and so the count of periods is not the same as the 
length of the string). In this case there was one correct record (thus one 
leading dot) before the problem occurred.

/c
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to