Re: [Tutor] best way to scrape html

Peter Kim Sat, 19 Feb 2005 22:22:04 -0800

Thank you, I subclassed SGMLParser.py, borrowing ideas from
DiveIntoPython's BaseHTMLProcessor.py.  It appears to work:


###
class Scrape(SGMLParser):
        TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i']
        def reset(self):
                self.pieces = []
                self.isScraping = 0 
                SGMLParser.reset(self)
        def unknown_starttag(self, tag, attrs):
                """Called for each start tag, attrs is alist of (attr, value) 
tuples,
                e.g. for <pre class='top'>, tag='pre', attrs=[('class', 
'top')]"""
                for i,v in attrs:
                        if  ('name' == i) and ('anchor' in v): # name='anchor*'
                                self.isScraping += 1    #begin scraping
                                break
                        elif ('class' == i) and ('txtend' == v): # 
class='txtend'
                                self.isScraping -= 1    #stop scraping
                                break
                if self.isScraping and tag in self.TAGS_TO_SCRAPE:
                        self.pieces.append("<%(tag)s>" % locals())
        def unknown_endtag(self, tag):
                """Called for each end tag, e.g. for </pre>, tag='pre'"""
                if self.isScraping and tag in self.TAGS_TO_SCRAPE:
                        self.pieces.append("</%(tag)s>" % locals())
        def handle_charref(self, ref):
                """Called for each character reference, e.g. for '&#160;', 
ref='160'"""
                if self.isScraping:
                        self.pieces.append("&#%(ref)s;" % locals())
        def handle_entitydef(self, ref):
                """Called for each entity reference, e.g. for '&copy;', 
ref='copy'"""
                if self.isScraping:
                        self.pieces.append("&#%(ref)s" % locals())
                        if htmlentitydefs.entitydefs.has_key(ref):
                                self.pieces.append(";") # standard HTML 
entities end with ';'
        def handle_data(self, text):
                """Called for each block of plain text, i.e. outside of any 
tag"""
                if self.isScraping:
                        self.pieces.append(text)
        def output(self):
                return "".join(self.pieces) # return processed HTML as a single 
string
###

On Wed, 16 Feb 2005 06:48:18 -0000, Alan Gauld <[EMAIL PROTECTED]> wrote:
> 
> > Which method is best and most pythonic to scrape text data with
> > minimal formatting?
> 
> Use the HTMLParser module.
> 
> > I want to change the above to:
> >
> > <p><b>Trigger:</b> Debate on budget in Feb-Mar.  New moves to
> > cutmedical costs by better technology.</p>
> >
> > Since I wanted some practice in regex, I started with something like
> this:
> 
> Using regex is usually the wrong way to parse html for anything
> beyond the trivial. The parser module helps deal with the
> complexities.
> 
> > So I'm thinking of using sgmllib.py (as in the Dive into Python
> > example).  Is this where I should be using libxml2.py?  As you can
> > tell this is my first foray into both parsing and regex so advice in
> > terms of best practice would be very helpful.
> 
> There is an html parser which is built on the sgml one.
> Its rather more specific to your task.
> 
> Alan G.
> 
>
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] best way to scrape html

Reply via email to