On Fri, 2007-08-03 at 18:43 +0200, Filippo Giunchedi wrote: > [2007-08-03 18:15:00] INFO :: [EMAIL PROTECTED] :: :: > contactPersonalChanged :: msn.msnw.NotificationClient :: {'personal': > 'In qualche caso, la distinzione tra un ammasso globulare ed uno > galattico pu\xf2 non risultare del tutto immediata:', 'self': > 'instance', 'userHandle': '[EMAIL PROTECTED]'}
Do you know what character that should have been? In latin1 (and windows-1252), 0xf2 is ò (U+F2 LATIN SMALL LETTER O WITH GRAVE). Does that character make sense in that sentence? I wonder if anyone really knows whether the strings in the MSN protocol are supposed to be utf-8 or windows-1252. It would be interesting to see what the value of this string is when the personal message contains other characters such as ‘ (U+2018 LEFT SINGLE QUOTATION MARK). If the string is windows-1252 then that character would show up as '\x91'. > this is probably because glue.py is not using the "errors" argument to > string.decode() (see http://docs.python.org/lib/string-methods.html) which was > introduced in 2.3. > > I am not sure about the proper fix here, msnContact.personal.decode("utf-8", > "ignore") does the trick, although "replace" looks saner. (I'm currently > using "replace") Could you try msnContact.personal.decode ('windows-1252') and see if that triggers any other errors? In any case, it would certainly be sane to use 'replace'... or to fall back to windows-1252 if utf-8 decoding fails. Perhaps the chardet module could even be used. > thanks, > filippo -- Sam Morris http://robots.org.uk/ PGP key id 1024D/5EA01078 3412 EA18 1277 354B 991B C869 B219 7FDB 5EA0 1078
signature.asc
Description: This is a digitally signed message part