Paul Melvin wrote:
Hi,
Thanks very much to all your suggestions, I am looking into the suggestions
of Hugo and Alan.
The file is not very big, only 700KB (~20000 lines), which I think should be
fine to be loaded into memory?
I have two further questions though please, the lines are like this:
<img width="13" height="15" alt="NEW"
src="/m/I/I/star.png" />
<strong><a href="/browse/post/5354361/">Revenge
(2011)</a></strong>
</td>
<td class="final">
<span title="Exact date/time: 05-01-2011 23:08"
class="ageVeryNew">5 days </span>
</td>
<td class="final">
<span title="Exact date/time: 18-01-2011 16:06"
class="ageVeryNew">65 minutes </span>
Etc with a chunk (between each NEW) being about 60 lines, I need to extract
info from these lines, e.g. /browse/post/5354361/ and Revenge (2011) to pass
back to the output, is re the best option to get all these various bits,
maybe a generic function that I pass the search strings too?
And if I use the split suggestion of Alan's I assume the last one would be
the rest of the file, would the next() option just let me search for the
next /browse/post/5354361/ etc after the NEW? (maybe putting this info into
a list)
One way to handle "the rest of the file" is to add a marker at the end
of the data. So if you read the whole thing with readlines(), you can
append another "NEW" so that all matches are between one NEW and the next.
Thanks again
paul
<snip>
If this file is valid html, or xml, then perhaps you should use one of
the html or xml parsing tools, rather than anything so esoteric as
regex. In any case, it now appears that NEW won't necessarily be
unique, so you might want to start with 'alt="NEW"' or something like
that. A key question becomes whether this data was automatically
generated, or whether it might have variations from one sample to the
next. (for example, alt = "NEW" with different spacing. or
ALT="NEW") And whether it's definitely valid html, or just close.
DaveA
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor