Thank you, I subclassed SGMLParser.py, borrowing ideas from
DiveIntoPython's BaseHTMLProcessor.py. It appears to work:
###
class Scrape(SGMLParser):
TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i']
def reset(self):
self.pieces = []
self.isScraping = 0
You might find these threads on comp.lang.python interesting:
http://tinyurl.com/5zmpn
http://tinyurl.com/6mxmb
Peter Kim wrote:
Which method is best and most pythonic to scrape text data with
minimal formatting?
I'm trying to read a large html file and strip out most of the markup,
but leaving the
> Which method is best and most pythonic to scrape text data with
> minimal formatting?
Use the HTMLParser module.
> I want to change the above to:
>
> Trigger: Debate on budget in Feb-Mar. New moves to
> cutmedical costs by better technology.
>
> Since I wanted some practice in regex, I starte
Which method is best and most pythonic to scrape text data with
minimal formatting?
I'm trying to read a large html file and strip out most of the markup,
but leaving the simple formatting like , , and . For example:
Trigger:
Debate on budget in Feb-Mar. New moves to cut medical costs by better