Thank you, I subclassed SGMLParser.py, borrowing ideas from
DiveIntoPython's BaseHTMLProcessor.py. It appears to work:
###
class Scrape(SGMLParser):
TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i']
def reset(self):
self.pieces = []
self.isScraping = 0
SGMLParser.reset(self)
def unknown_starttag(self, tag, attrs):
"""Called for each start tag, attrs is alist of (attr, value)
tuples,
e.g. for , tag='pre', attrs=[('class',
'top')]"""
for i,v in attrs:
if ('name' == i) and ('anchor' in v): # name='anchor*'
self.isScraping += 1#begin scraping
break
elif ('class' == i) and ('txtend' == v): #
class='txtend'
self.isScraping -= 1#stop scraping
break
if self.isScraping and tag in self.TAGS_TO_SCRAPE:
self.pieces.append("<%(tag)s>" % locals())
def unknown_endtag(self, tag):
"""Called for each end tag, e.g. for , tag='pre'"""
if self.isScraping and tag in self.TAGS_TO_SCRAPE:
self.pieces.append("" % locals())
def handle_charref(self, ref):
"""Called for each character reference, e.g. for ' ',
ref='160'"""
if self.isScraping:
self.pieces.append("%(ref)s;" % locals())
def handle_entitydef(self, ref):
"""Called for each entity reference, e.g. for '©',
ref='copy'"""
if self.isScraping:
self.pieces.append("%(ref)s" % locals())
if htmlentitydefs.entitydefs.has_key(ref):
self.pieces.append(";") # standard HTML
entities end with ';'
def handle_data(self, text):
"""Called for each block of plain text, i.e. outside of any
tag"""
if self.isScraping:
self.pieces.append(text)
def output(self):
return "".join(self.pieces) # return processed HTML as a single
string
###
On Wed, 16 Feb 2005 06:48:18 -, Alan Gauld <[EMAIL PROTECTED]> wrote:
>
> > Which method is best and most pythonic to scrape text data with
> > minimal formatting?
>
> Use the HTMLParser module.
>
> > I want to change the above to:
> >
> > Trigger: Debate on budget in Feb-Mar. New moves to
> > cutmedical costs by better technology.
> >
> > Since I wanted some practice in regex, I started with something like
> this:
>
> Using regex is usually the wrong way to parse html for anything
> beyond the trivial. The parser module helps deal with the
> complexities.
>
> > So I'm thinking of using sgmllib.py (as in the Dive into Python
> > example). Is this where I should be using libxml2.py? As you can
> > tell this is my first foray into both parsing and regex so advice in
> > terms of best practice would be very helpful.
>
> There is an html parser which is built on the sgml one.
> Its rather more specific to your task.
>
> Alan G.
>
>
___
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor