-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I'm continuting to try and run webcheck on the debian website; it now fails with a crash in beautifulsoup: webcheck: http://www.slf.ch/
[...]
File "/usr/share/webcheck/parsers/html/beautifulsoup.py", line 60, in parse base = myurllib.normalizeurl(htmlunescape(base['href']).strip()) File "/var/lib/python-support/python2.4/BeautifulSoup.py", line 419, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' This occurs after running it on http://www.nl.debian.org for a while; continuing with webcheck -c does work though, and webcheck doesn't crash then anymore...
[...]
Versions of packages webcheck recommends: ii python-beautifulsoup 3.0.1-2 error-tolerant HTML parser for Pyt
This is a bug in BeautifulSoup in the version in etch (3.0.1-2 has the problem, 3.0.4-1 does not). Maybe I should change to a versioned Recommends (or maybe a Conflicts with older versions).
I could include some workaround code but I don't think that is worth the effort. This would simplify backports.
Anyway, the problem is that the base tag is expected to have an href attribute (the used find is supposed
By the way, if you're crawling a very big website (like Debian's) I would highly recommend using Python 2.5 instead of 2.4. Python 2.5 has much better performing sets.
- -- - -- arthur - [EMAIL PROTECTED] - http://people.debian.org/~adejong --
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGnNtDVYan35+NCKcRAhICAJ9zdpi+9cJQxlZu+1QZehnLLlg1NgCgxBgC 1OTmUlqfoZU5YMN8lY+LgDg= =7xdl -----END PGP SIGNATURE----- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]