Package: python-beautifulsoup
Version: 3.0.4-1
Severity: normal

BeautifulSoup seems to use the content-type correctly to parse
entities in the text of an HTML string, but not when they occur inside
attribute strings.

The following program produces the error:

#---- cut here ----
#!/usr/bin/python

from BeautifulSoup import BeautifulSoup

input1 = '''
    <html>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
    <h1>Here is a Latin-1 entity: &#174;</h1>
    </html>  
'''

print BeautifulSoup(input1)

input2 = '''
    <html>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
    <meta name="Description" content="Here is a Latin-1 entity: &#174;" />
    </html>  
'''

print BeautifulSoup(input2)

#---- cut here ----

Here's what it produces on my system:

<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<h1>Here is a Latin-1 entity: &#174;</h1>
</html>

Traceback (most recent call last):
  File "./bug.py", line 21, in <module>
    print BeautifulSoup(input2)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in 
__init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in 
__init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1372, in 
start_meta
    self._feed(self.declaredHTMLEncoding)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal 
not in range(128)

-- System Information:
Debian Release: lenny/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'stable'), (400, 'unstable'), (1, 
'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.24-1-686 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages python-beautifulsoup depends on:
ii  python                        2.5.2-1    An interactive high-level object-o
ii  python-support                0.7.7      automated rebuilding support for P

python-beautifulsoup recommends no packages.

-- no debconf information



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to