Re: [Tutor] best way to scrape html

2005-02-19 Thread Peter Kim
Thank you, I subclassed SGMLParser.py, borrowing ideas from DiveIntoPython's BaseHTMLProcessor.py. It appears to work: ### class Scrape(SGMLParser): TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i'] def reset(self): self.pieces = [] self.isScraping = 0

Re: [Tutor] best way to scrape html

2005-02-16 Thread Kent Johnson
You might find these threads on comp.lang.python interesting: http://tinyurl.com/5zmpn http://tinyurl.com/6mxmb Peter Kim wrote: Which method is best and most pythonic to scrape text data with minimal formatting? I'm trying to read a large html file and strip out most of the markup, but leaving the

Re: [Tutor] best way to scrape html

2005-02-15 Thread Alan Gauld
> Which method is best and most pythonic to scrape text data with > minimal formatting? Use the HTMLParser module. > I want to change the above to: > > Trigger: Debate on budget in Feb-Mar. New moves to > cutmedical costs by better technology. > > Since I wanted some practice in regex, I starte

[Tutor] best way to scrape html

2005-02-15 Thread Peter Kim
Which method is best and most pythonic to scrape text data with minimal formatting? I'm trying to read a large html file and strip out most of the markup, but leaving the simple formatting like , , and . For example: Trigger:  Debate on budget in Feb-Mar. New moves to cut medical costs by better