> Here you go: > > >>> import types > >>> print types.StringTypes > (<type 'str'>, <type 'unicode'>) > >>> import sys > >>> print sys.version > 2.3.4 (#2, May 29 2004, 03:31:27) > [GCC 3.3.3 (Debian 20040417)] > >>> print type(u'hello' in types.StringTypes > True > >>>sys.getdefaultencoding() > 'ascii'
[CCing Leonard Richardson: we found a bug and a correction to the code. See below.] Ok, this is officially a mystery. *grin* Let me try some tests too. ###### >>> import BeautifulSoup >>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>") >>> import re >>> result = soup.fetchText(re.compile('.*')) Traceback (most recent call last): File "<stdin>", line 1, in ? File "BeautifulSoup.py", line 465, in fetchText return self.fetch(recursive=recursive, text=text, limit=limit) File "BeautifulSoup.py", line 491, in fetch return self._fetch(name, attrs, text, limit, generator) File "BeautifulSoup.py", line 193, in _fetch if self._matches(i, text): File "BeautifulSoup.py", line 251, in _matches chunk = str(chunk) UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 0: ordinal not in range(128) ###### Gaaa! Ok, that's not right. Well, at least I'm seeing the same results as you. *grin* This seems like a bug in BeautifulSoup; let me look at the flow of values again... ah! I see. That was silly. The problem is that 'chunk' can be a NavigableString or a NavigatableUnicodeString, and neither of those types are in types.StringType. So the bit of code here: if not type(chunk) in types.StringTypes: never worked properly. *grin* A possible fix to this is to change the check for direct types into a check for subclass or isinstance; we can to change the line in BeautifulSoup.py:250 from: if not type(chunk) in types.StringTypes: to: if not isinstance(chunk, basestring): Testing the change now... ###### >>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>") >>> result = soup.fetchText(re.compile('.*')) >>> result [u'\xbb'] ###### Ah, better. *grin* One other problem is the implementation of __repr__(); I know it's convenient for it to delegate to str(), but that poses a problem: ###### >>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>") >>> soup Traceback (most recent call last): File "<stdin>", line 1, in ? File "BeautifulSoup.py", line 374, in __repr__ return str(self) UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 6: ordinal not in range(128) ###### repr() should never fail like this, regardless of our default encoding. The cheap way out might be to just not implement repr(), but that's probably not so nice. *grin* I'd have to look at the implementation of __str__() some more and see if there's a good general way to fix this. Best of wishes! _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor