[Tutor] best way to scrape html

2005-02-15 Thread Peter Kim
Which method is best and most pythonic to scrape text data with
minimal formatting?

I'm trying to read a large html file and strip out most of the markup,
but leaving the simple formatting like , , and .  For example:

Trigger: 
Debate on budget in Feb-Mar. New moves to cut medical costs by better
technology.

I want to change the above to:

Trigger: Debate on budget in Feb-Mar.  New moves to
cutmedical costs by better technology.

Since I wanted some practice in regex, I started with something like this:

pattern = "(?:<)(.+?)(?: ?.*?>)(.*?)()"
result = re.compile(pattern, re.IGNORECASE | re.VERBOSE |
re.DOTALL).findall(html)

But it's getting messy real fast and somehow the non-greedy parts
don't seem to work as intended.  Also I realized that the html file is
going to be 10,000+ lines, so I wonder if regex can be used for large
strings.

So I'm thinking of using sgmllib.py (as in the Dive into Python
example).  Is this where I should be using libxml2.py?  As you can
tell this is my first foray into both parsing and regex so advice in
terms of best practice would be very helpful.

Thanks,
Peter Kim
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] help with HTMLParseError

2005-02-17 Thread Peter Kim
I'm using HTMLParser.py to parse XHTML and invalid tag is throwing an
exception.  How do I handle this?

1. Below is the faulty markup.  Notice the missing >.  Both Firefox
and IE6 correct automatically but HTMLParser is less forgiving.  My
code has to be able to treat this gracefully because I don't have
control over the XHTML source.

###/

/###

2. Below is the current code that raises a self.error("malformed start
tag") at line 301 in HTMLParser.py due to the invalid markup.

###/
from HTMLParser import HTMLParser

def parseHTML(htmlsource):
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
   print "<%s>" % tag,
def handle_endtag(self, tag):
print "" % tag,
MyParser = MyHTMLParser()
MyParser.feed(htmlsource)
MyParser.close()
return MyParser.output()

if __name__ == "__main":
htmlsource = r""
result = parseHTML(htmlsource)
/###

3. I think the ideal solution is to be able to do something like
below, but I don't know how.

###/
class MyHTMLParseError(HTMLParseError):
if self.message == "malformed start tag":
text.append(">")
else:
raise
/###

Thanks in advance for the help!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] best way to scrape html

2005-02-19 Thread Peter Kim
Thank you, I subclassed SGMLParser.py, borrowing ideas from
DiveIntoPython's BaseHTMLProcessor.py.  It appears to work:

###
class Scrape(SGMLParser):
TAGS_TO_SCRAPE = ['p', 'br', 'b', 'i']
def reset(self):
self.pieces = []
self.isScraping = 0 
SGMLParser.reset(self)
def unknown_starttag(self, tag, attrs):
"""Called for each start tag, attrs is alist of (attr, value) 
tuples,
e.g. for , tag='pre', attrs=[('class', 
'top')]"""
for i,v in attrs:
if  ('name' == i) and ('anchor' in v): # name='anchor*'
self.isScraping += 1#begin scraping
break
elif ('class' == i) and ('txtend' == v): # 
class='txtend'
self.isScraping -= 1#stop scraping
break
if self.isScraping and tag in self.TAGS_TO_SCRAPE:
self.pieces.append("<%(tag)s>" % locals())
def unknown_endtag(self, tag):
"""Called for each end tag, e.g. for , tag='pre'"""
if self.isScraping and tag in self.TAGS_TO_SCRAPE:
self.pieces.append("" % locals())
def handle_charref(self, ref):
"""Called for each character reference, e.g. for ' ', 
ref='160'"""
if self.isScraping:
self.pieces.append("&#%(ref)s;" % locals())
def handle_entitydef(self, ref):
"""Called for each entity reference, e.g. for '©', 
ref='copy'"""
if self.isScraping:
self.pieces.append("&#%(ref)s" % locals())
if htmlentitydefs.entitydefs.has_key(ref):
self.pieces.append(";") # standard HTML 
entities end with ';'
def handle_data(self, text):
"""Called for each block of plain text, i.e. outside of any 
tag"""
if self.isScraping:
self.pieces.append(text)
def output(self):
return "".join(self.pieces) # return processed HTML as a single 
string
###

On Wed, 16 Feb 2005 06:48:18 -, Alan Gauld <[EMAIL PROTECTED]> wrote:
> 
> > Which method is best and most pythonic to scrape text data with
> > minimal formatting?
> 
> Use the HTMLParser module.
> 
> > I want to change the above to:
> >
> > Trigger: Debate on budget in Feb-Mar.  New moves to
> > cutmedical costs by better technology.
> >
> > Since I wanted some practice in regex, I started with something like
> this:
> 
> Using regex is usually the wrong way to parse html for anything
> beyond the trivial. The parser module helps deal with the
> complexities.
> 
> > So I'm thinking of using sgmllib.py (as in the Dive into Python
> > example).  Is this where I should be using libxml2.py?  As you can
> > tell this is my first foray into both parsing and regex so advice in
> > terms of best practice would be very helpful.
> 
> There is an html parser which is built on the sgml one.
> Its rather more specific to your task.
> 
> Alan G.
> 
>
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor