Hi, I'm having bang-my-head-against-a-wall moments trying to figure all of this out.
A word of warming, this is the first time I've tried using unicode, or Beautiful Soup, so if I'm being stupid, please forgive me. I'm trying to scrape results from google as a test case. with Beautiful Soup. I've seen people recommend it here, so maybe somebody can recognize what I'm doing wrong: >>>from BeautifulSoup import BeautifulSoup >>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup") >>>file = file.read().decode("utf-8") >>>soup = BeautifulSoup(file) >>>results = soup('p','g') >>> x = results[1].a.renderContents() >>> type(x) <type 'unicode'> >>> print x Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b> So far so good. But what I really want is just the text, so I try something like: >>> y = results[1].a.fetchText(re.compile('.+')) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "BeautifulSoup.py", line 466, in fetchText return self.fetch(recursive=recursive, text=text, limit=limit) File "BeautifulSoup.py", line 492, in fetch return self._fetch(name, attrs, text, limit, generator) File "BeautifulSoup.py", line 194, in _fetch if self._matches(i, text): File "BeautifulSoup.py", line 252, in _matches chunk = str(chunk) UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 26: ordinal not in range(128) Is this a bug? Come to think of it, I'm not even sure how printing x worked, since it printed non-ascii characters. If I convert to a string first: >>> filestr = file.encode("utf-8") >>> soup = BeautifulSoup(filestr) >>> soup('p','g')[1].font.fetchText(re.compile('.+')) ['Mobile Screen Scraping with ', 'BeautifulSoup', ' and Python for Series 60. ', 'BeautifulSoup', ' 2', 'BeautifulSoup', ' 3. I haven\xe2€™t had enough time to work up a proper hack for ', '...', 'www.postneo.com/2005/03/28/', 'mobile-screen-scraping-with-', 'beautifulsoup', '-and-python-for-series-60 - 19k - Aug 24, 2005 - ', ' ', 'Cached', ' - ', 'Similar pages'] The regex works, but things like "I haven\xe2€™t" get a bit mangled :) In filestr, it was represented as haven\xe2\x80\x99t which I guess is the ASCII representation for UTF-8. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor