thinking of using sgmllib.py (as in the Dive into Python
example). Is this where I should be using libxml2.py? As you can
tell this is my first foray into both parsing and regex so advice in
terms of best practice would be very helpful.
Thanks,
Peter Kim
I'm using HTMLParser.py to parse XHTML and invalid tag is throwing an
exception. How do I handle this?
1. Below is the faulty markup. Notice the missing >. Both Firefox
and IE6 correct automatically but HTMLParser is less forgiving. My
code has to be able to treat this gracefully because I don
Thank you, I subclassed SGMLParser.py, borrowing ideas from
DiveIntoPython's BaseHTMLProcessor.py. It appears to work:
###
class Scrape(SGMLParser):
TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i']
def reset(self):
self.pieces = []
self.isScraping = 0