Re: HTMLParser fragility

2006-04-10 Thread John J. Lee
"Lawrence D'Oliveiro" <[EMAIL PROTECTED]> writes: > I've been using HTMLParser to scrape Web sites. The trouble with this > is, there's a lot of malformed HTML out there. Real browsers have to be > written to cope gracefully with this, but HTMLParser does not. Not only > does it raise an except

Re: HTMLParser fragility

2006-04-07 Thread Richie Hindle
[Richie] > But Tidy fails on huge numbers of real-world HTML pages. [...] > Is there a Python HTML tidier which will do as good a job as a browser? [Walter] > You can also use the HTML parser from libxml2 [Paul] > libxml2 will attempt to parse HTML if asked to [...] See how it fixes > up the mi

Re: HTMLParser fragility

2006-04-06 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, Rene Pijlman <[EMAIL PROTECTED]> wrote: >2. Use something more foregiving, like BeautifulSoup. >http://www.crummy.com/software/BeautifulSoup/ That sounds like what I'm after! -- http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

2006-04-06 Thread Paul Boddie
Richie Hindle wrote: > > But Tidy fails on huge numbers of real-world HTML pages. Simple things like > misspelled tags make it fail: > > >>> from mx.Tidy import tidy > >>> results = tidy("Hello world!") [Various error messages] > Is there a Python HTML tidier which will do as good a job as a bro

Re: HTMLParser fragility

2006-04-06 Thread Walter Dörwald
Rene Pijlman wrote: > Lawrence D'Oliveiro: >> I've been using HTMLParser to scrape Web sites. The trouble with this >> is, there's a lot of malformed HTML out there. Real browsers have to be >> written to cope gracefully with this, but HTMLParser does not. > > There are two solutions to this: >

Re: HTMLParser fragility

2006-04-05 Thread Richie Hindle
[Daniel] > You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) > as a first step to get well formed HTML. But Tidy fails on huge numbers of real-world HTML pages. Simple things like misspelled tags make it fail: >>> from mx.Tidy import tidy >>> results = tidy("Hello world!"

Re: HTMLParser fragility

2006-04-05 Thread Daniel Dittmar
Lawrence D'Oliveiro wrote: > I've been using HTMLParser to scrape Web sites. The trouble with this > is, there's a lot of malformed HTML out there. Real browsers have to be > written to cope gracefully with this, but HTMLParser does not. Not only > does it raise an exception, but the parser obje

Re: HTMLParser fragility

2006-04-05 Thread Rene Pijlman
Lawrence D'Oliveiro: >I've been using HTMLParser to scrape Web sites. The trouble with this >is, there's a lot of malformed HTML out there. Real browsers have to be >written to cope gracefully with this, but HTMLParser does not. There are two solutions to this: 1. Tidy the source before parsin