<snip> > Today I had a csv file in utf-8 encoding, but part of the accented > characters were mangled. The data were scraped from a website and it > turned out that at least some of the data were mangled on the website > already. Bits of the text were actually cp1252 (or cp850), I think, > even though the webpage was in utf-8 Is there any package that helps > to correct such issues? The links in the Wikipedia article may help: http://en.wikipedia.org/wiki/Charset_detection International Components for Unicode (ICU) does charset detection: http://userguide.icu-project.org/conversion/detection Python wrapper: http://pypi.python.org/pypi/PyICU http://packages.debian.org/wheezy/python-pyicu Example: import icu russian_text = u'Здесь некий текст на русском языке.' encoded_text = russian_text.encode('windows-1251') cd = icu.CharsetDetector() cd.setText(encoded_text) match = cd.detect() matches = cd.detectAll() >>> match.getName() 'windows-1251' >>> match.getConfidence() 33 >>> match.getLanguage() 'ru' >>> [m.getName() for m in matches] ['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I', 'ISO-8859-8'] >>> [m.getConfidence() for m in matches] [33, 13, 8, 8]
====> Hi Mark, Eryksun, Thank you very much for your suggestions. Mark (sorry if I repeat myself but I think my earlier reply got lost), charset seems worth looking into. In hindsight I knew about chardet (with 'd'), I just forgot about it. Re: your other remark: I think encoding issues are such a common phenomenon that one can never be too inexperienced to start reading about it. The ICU module seems very cool too. I like the fact that you can even calculate a level of confidence. I wonder how it performs in my language (Dutch), where accented characters are not very common. Most is ascii (the printable chars in 0-128) and those are (I think) useless for trying to figure out the encoding. After all, utf-8, latin-1, cp1252, iso-8859-1 are all supersets of ascii. But in practice I treat those last three encodings as the same anyway (or was there some sneaky difference with fancyquotes?). I did a quick check and 0.2 % of the street names in my data (about 300K records) contain one or more accented characters (ordinals > 128). Since only part of the records are mangled, I may need to run getName() on every record that has accented characters in it. Regards, Albert-Jan _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor