Thank you, I subclassed SGMLParser.py, borrowing ideas from DiveIntoPython's BaseHTMLProcessor.py. It appears to work:
### class Scrape(SGMLParser): TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i'] def reset(self): self.pieces = [] self.isScraping = 0 SGMLParser.reset(self) def unknown_starttag(self, tag, attrs): """Called for each start tag, attrs is alist of (attr, value) tuples, e.g. for <pre class='top'>, tag='pre', attrs=[('class', 'top')]""" for i,v in attrs: if ('name' == i) and ('anchor' in v): # name='anchor*' self.isScraping += 1 #begin scraping break elif ('class' == i) and ('txtend' == v): # class='txtend' self.isScraping -= 1 #stop scraping break if self.isScraping and tag in self.TAGS_TO_SCRAPE: self.pieces.append("<%(tag)s>" % locals()) def unknown_endtag(self, tag): """Called for each end tag, e.g. for </pre>, tag='pre'""" if self.isScraping and tag in self.TAGS_TO_SCRAPE: self.pieces.append("</%(tag)s>" % locals()) def handle_charref(self, ref): """Called for each character reference, e.g. for ' ', ref='160'""" if self.isScraping: self.pieces.append("&#%(ref)s;" % locals()) def handle_entitydef(self, ref): """Called for each entity reference, e.g. for '©', ref='copy'""" if self.isScraping: self.pieces.append("&#%(ref)s" % locals()) if htmlentitydefs.entitydefs.has_key(ref): self.pieces.append(";") # standard HTML entities end with ';' def handle_data(self, text): """Called for each block of plain text, i.e. outside of any tag""" if self.isScraping: self.pieces.append(text) def output(self): return "".join(self.pieces) # return processed HTML as a single string ### On Wed, 16 Feb 2005 06:48:18 -0000, Alan Gauld <[EMAIL PROTECTED]> wrote: > > > Which method is best and most pythonic to scrape text data with > > minimal formatting? > > Use the HTMLParser module. > > > I want to change the above to: > > > > <p><b>Trigger:</b> Debate on budget in Feb-Mar. New moves to > > cutmedical costs by better technology.</p> > > > > Since I wanted some practice in regex, I started with something like > this: > > Using regex is usually the wrong way to parse html for anything > beyond the trivial. The parser module helps deal with the > complexities. > > > So I'm thinking of using sgmllib.py (as in the Dive into Python > > example). Is this where I should be using libxml2.py? As you can > > tell this is my first foray into both parsing and regex so advice in > > terms of best practice would be very helpful. > > There is an html parser which is built on the sgml one. > Its rather more specific to your task. > > Alan G. > > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor