Package: feedparser
Version: 4.1-11
Tags: patch

Trying to parse the RSS feed http://www.projekt6.de/?feed=podcast with
feedparser yields the following traceback:

[EMAIL PROTECTED]:~$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> f = feedparser.parse('http://www.projekt6.de/?feed=podcast')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/python-support/python2.5/feedparser.py", line 2624, in parse
    feedparser.feed(data)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 138, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 315, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python2.5/sgmllib.py", line 355, in finish_endtag
    self.unknown_endtag(tag)
  File "/var/lib/python-support/python2.5/feedparser.py", line 476, in
unknown_endtag
    method()
  File "/var/lib/python-support/python2.5/feedparser.py", line 1318,
in _end_content
    value = self.popContent('content')
  File "/var/lib/python-support/python2.5/feedparser.py", line 700, in
popContent
    value = self.pop(tag)
  File "/var/lib/python-support/python2.5/feedparser.py", line 641, in pop
    output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1594,
in _resolveRelativeURIs
    p.feed(htmlSource)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 333, in finish_starttag
    self.unknown_starttag(tag, attrs)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1589,
in unknown_starttag
    _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1458,
in unknown_starttag
    value = unicode(value, self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-8:
unsupported Unicode code range

Attached is a patch, currently applied in the Ubuntu package, that
fixes this problem. Adding this to the Debian package would let Ubuntu
drop their changes and auto-sync.

-- 
David Futcher (bobbo)
Ubuntu Universe Contributor
http://www.bobbo.me.uk
http://www.launchpad.net/~bobbo
--- /var/lib/python-support/python2.5/feedparser.py	2008-01-23 20:10:27.000000000 +0100
+++ feedparser.py	2008-07-28 11:01:38.000000000 +0200
@@ -1455,7 +1455,7 @@
         # thanks to Kevin Marks for this breathtaking hack to deal with (valid) high-bit attribute values in UTF-8 feeds
         for key, value in attrs:
             if type(value) != type(u''):
-                value = unicode(value, self.encoding)
+                value = unicode(value, self.encoding, errors='replace')
             uattrs.append((unicode(key, self.encoding), value))
         strattrs = u''.join([u' %s="%s"' % (key, value) for key, value in uattrs]).encode(self.encoding)
         if tag in self.elements_no_end_tag:

Reply via email to