" found in attribute value

Duncan Booth Wed, 27 Dec 2006 01:31:00 -0800

John Nagle <[EMAIL PROTECTED]> wrote:

> And this came out, via prettify:
> 
><addresssnippet siteurl="http%3A//apartmentsapart.com" 
> url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
>      <param name="movie"
>      value="/images/offersBanners/sw04.swf?binfot=We offer 
> fantastic rates for selected weeks or days!!&amp;blinkt=Click here 
> &gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
> >>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
></param>
> 
> BeautifulSoup seems to have become confused by the ">>>" within
> a quoted attribute value.  It first parsed it right, but then stuck
> in an extra, totally bogus line.  Note the entity "&linkurl;", which
> appears nowhere in the original.  It looks like code to handle a
> missing quote mark did the wrong thing.


I don't think I would quibble with what BeautifulSoup extracted from that 
mess. The input isn't valid HTML so any output has to be guessing at what 
was meant. A lot of code for parsing html would assume that there was a 
quote missing and the tag was terminated by the first '>'. IE and Firefox 
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup 
seems to have given you the best of both worlds: the attribute is parsed to 
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find  it is just 
being nice and cleaning up an unterminated entity. Browsers (or at least 
IE) will often accept entities without the terminating semicolon, so that's 
a common problem in badly formed html that BeautifulSoup can fix.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup bug when ">>>" found in attribute value

Reply via email to