Package: feedparser Version: 4.1-11 Tags: patch Trying to parse the RSS feed http://www.projekt6.de/?feed=podcast with feedparser yields the following traceback:
[EMAIL PROTECTED]:~$ python Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import feedparser >>> f = feedparser.parse('http://www.projekt6.de/?feed=podcast') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/python-support/python2.5/feedparser.py", line 2624, in parse feedparser.feed(data) File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed sgmllib.SGMLParser.feed(self, data) File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 138, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.5/sgmllib.py", line 315, in parse_endtag self.finish_endtag(tag) File "/usr/lib/python2.5/sgmllib.py", line 355, in finish_endtag self.unknown_endtag(tag) File "/var/lib/python-support/python2.5/feedparser.py", line 476, in unknown_endtag method() File "/var/lib/python-support/python2.5/feedparser.py", line 1318, in _end_content value = self.popContent('content') File "/var/lib/python-support/python2.5/feedparser.py", line 700, in popContent value = self.pop(tag) File "/var/lib/python-support/python2.5/feedparser.py", line 641, in pop output = _resolveRelativeURIs(output, self.baseuri, self.encoding) File "/var/lib/python-support/python2.5/feedparser.py", line 1594, in _resolveRelativeURIs p.feed(htmlSource) File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed sgmllib.SGMLParser.feed(self, data) File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.5/sgmllib.py", line 333, in finish_starttag self.unknown_starttag(tag, attrs) File "/var/lib/python-support/python2.5/feedparser.py", line 1589, in unknown_starttag _BaseHTMLProcessor.unknown_starttag(self, tag, attrs) File "/var/lib/python-support/python2.5/feedparser.py", line 1458, in unknown_starttag value = unicode(value, self.encoding) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-8: unsupported Unicode code range Attached is a patch, currently applied in the Ubuntu package, that fixes this problem. Adding this to the Debian package would let Ubuntu drop their changes and auto-sync. -- David Futcher (bobbo) Ubuntu Universe Contributor http://www.bobbo.me.uk http://www.launchpad.net/~bobbo
--- /var/lib/python-support/python2.5/feedparser.py 2008-01-23 20:10:27.000000000 +0100 +++ feedparser.py 2008-07-28 11:01:38.000000000 +0200 @@ -1455,7 +1455,7 @@ # thanks to Kevin Marks for this breathtaking hack to deal with (valid) high-bit attribute values in UTF-8 feeds for key, value in attrs: if type(value) != type(u''): - value = unicode(value, self.encoding) + value = unicode(value, self.encoding, errors='replace') uattrs.append((unicode(key, self.encoding), value)) strattrs = u''.join([u' %s="%s"' % (key, value) for key, value in uattrs]).encode(self.encoding) if tag in self.elements_no_end_tag: