[Tutor] best way to scrape html

2005-02-15 Thread Peter Kim
thinking of using sgmllib.py (as in the Dive into Python example). Is this where I should be using libxml2.py? As you can tell this is my first foray into both parsing and regex so advice in terms of best practice would be very helpful. Thanks, Peter Kim

[Tutor] help with HTMLParseError

2005-02-17 Thread Peter Kim
I'm using HTMLParser.py to parse XHTML and invalid tag is throwing an exception. How do I handle this? 1. Below is the faulty markup. Notice the missing >. Both Firefox and IE6 correct automatically but HTMLParser is less forgiving. My code has to be able to treat this gracefully because I don

Re: [Tutor] best way to scrape html

2005-02-19 Thread Peter Kim
Thank you, I subclassed SGMLParser.py, borrowing ideas from DiveIntoPython's BaseHTMLProcessor.py. It appears to work: ### class Scrape(SGMLParser): TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i'] def reset(self): self.pieces = [] self.isScraping = 0