Re: [Tutor] best way to scrape html

Alan Gauld Tue, 15 Feb 2005 22:53:14 -0800

> Which method is best and most pythonic to scrape text data with
> minimal formatting?


Use the HTMLParser module.

> I want to change the above to:
>
> <p><b>Trigger:</b> Debate on budget in Feb-Mar.  New moves to
> cutmedical costs by better technology.</p>
>
> Since I wanted some practice in regex, I started with something like
this:

Using regex is usually the wrong way to parse html for anything
beyond the trivial. The parser module helps deal with the
complexities.

> So I'm thinking of using sgmllib.py (as in the Dive into Python
> example).  Is this where I should be using libxml2.py?  As you can
> tell this is my first foray into both parsing and regex so advice in
> terms of best practice would be very helpful.

There is an html parser which is built on the sgml one.
Its rather more specific to your task.

Alan G.

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] best way to scrape html

Reply via email to