| I have about 150 lines of python extracting text from large file, the
| problem I need a few lines to clean first to avoid the problem the
| script is facing
Hello,
This seems like a well laid out task. If you post what you are trying and the
problems you are encountering, that would be helpful.
One suggestion that I have is that you switch problems 1 and 2. If the ordering
is broken (e.g. HHFR instead of HFRH) then knowing where to put the
parenthetical comment is going to be a problem. Also, you said that you wanted
it put after the "F" reference did you mean that is should look like this:
| AFTER your process
|| H 00100 "a friend in need is a friend indeed so select the best
|| friend
| as soon as you can blah"
|| F Old London book (xyz blah words) <=== parenthetical here?
|| R Cool
It's a little hard to tell from what you've said, but it looks like the "|" was
an unnecessary addition. If your record markers were always a single character
at the beginning of a line, those are easy enough to find--provided there is
never an H, F, or R that is a NON-record marker at the beginning of a line as a
single character.
######
>>> text='''H This is the start.
... F here is a reference.
... Right here is a non-reference R but it's not a single character starting
the line
... so it won't be matched; and the single one in the middle isn't at the start.
... R cool'''
>>> import re
>>> text = '\n'+text #make the first one like all the others: preceded by
>>> newline character
>>> re.findall(r'\n([HFR])\b', text)
['H', 'F', 'R']
>>> re.split(r'\n([HFR])\b', text)
['', 'H', ' This is the start.', 'F', " here is a reference. \nRight here is a
non-reference R but it's not a single character starting the line\nso it won't
be matched; and the single one in the middle isn't at the start.", 'R', ' cool']
######
That last list has all the groups with the identifier preceding the
corresponding data.
Finally, I'm not sure how you are checking the correctness of the HFR sequence,
but the findall used above suggests a way to do it:
-do the findall
-join the results together
-replace 'HFR' with '.'
-if the whole string isn't dots then there was a problem and the number of dots
before the non-dot tell you how many correct records there were.
######
>>> bad='''
... H
... F
... R
... R
... '''
>>> re.findall(r'\n([HFR])\b', bad)
['H', 'F', 'R', 'R']
>>> ''.join(_) # the _ refers to the last output
'HFRR'
>>> _.replace('HFR', '.')
'.R'
>>> len(_),_.count('.')
(2, 1)
######
Notice that since not all the HFRs were complete, there are not all the
characters are periods (and so the count of periods is not the same as the
length of the string). In this case there was one correct record (thus one
leading dot) before the problem occurred.
/c
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor