John Nagle <[EMAIL PROTECTED]> wrote: > And this came out, via prettify: > ><addresssnippet siteurl="http%3A//apartmentsapart.com" > url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ"> > <param name="movie" > value="/images/offersBanners/sw04.swf?binfot=We offer > fantastic rates for selected weeks or days!!&blinkt=Click here > >>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408"> > >>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" /> ></param> > > BeautifulSoup seems to have become confused by the ">>>" within > a quoted attribute value. It first parsed it right, but then stuck > in an extra, totally bogus line. Note the entity "&linkurl;", which > appears nowhere in the original. It looks like code to handle a > missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that mess. The input isn't valid HTML so any output has to be guessing at what was meant. A lot of code for parsing html would assume that there was a quote missing and the tag was terminated by the first '>'. IE and Firefox seem to assume that the '>' is allowed inside the attribute. BeautifulSoup seems to have given you the best of both worlds: the attribute is parsed to the closing quote, but the tag itself ends at the first '>'. As for inserting a semicolon after linkurl, I think you'll find it is just being nice and cleaning up an unterminated entity. Browsers (or at least IE) will often accept entities without the terminating semicolon, so that's a common problem in badly formed html that BeautifulSoup can fix. -- http://mail.python.org/mailman/listinfo/python-list
