"Lawrence D'Oliveiro" <[EMAIL PROTECTED]> writes:
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an except
[Richie]
> But Tidy fails on huge numbers of real-world HTML pages. [...]
> Is there a Python HTML tidier which will do as good a job as a browser?
[Walter]
> You can also use the HTML parser from libxml2
[Paul]
> libxml2 will attempt to parse HTML if asked to [...] See how it fixes
> up the mi
In article <[EMAIL PROTECTED]>,
Rene Pijlman <[EMAIL PROTECTED]> wrote:
>2. Use something more foregiving, like BeautifulSoup.
>http://www.crummy.com/software/BeautifulSoup/
That sounds like what I'm after!
--
http://mail.python.org/mailman/listinfo/python-list
Richie Hindle wrote:
>
> But Tidy fails on huge numbers of real-world HTML pages. Simple things like
> misspelled tags make it fail:
>
> >>> from mx.Tidy import tidy
> >>> results = tidy("Hello world!")
[Various error messages]
> Is there a Python HTML tidier which will do as good a job as a bro
Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this
>> is, there's a lot of malformed HTML out there. Real browsers have to be
>> written to cope gracefully with this, but HTMLParser does not.
>
> There are two solutions to this:
>
[Daniel]
> You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
> as a first step to get well formed HTML.
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
>>> from mx.Tidy import tidy
>>> results = tidy("Hello world!"
Lawrence D'Oliveiro wrote:
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an exception, but the parser obje
Lawrence D'Oliveiro:
>I've been using HTMLParser to scrape Web sites. The trouble with this
>is, there's a lot of malformed HTML out there. Real browsers have to be
>written to cope gracefully with this, but HTMLParser does not.
There are two solutions to this:
1. Tidy the source before parsin