On Fri, 2007-08-03 at 18:43 +0200, Filippo Giunchedi wrote:
> [2007-08-03 18:15:00] INFO :: [EMAIL PROTECTED] ::  ::
> contactPersonalChanged :: msn.msnw.NotificationClient :: {'personal':
> 'In qualche caso, la distinzione tra un ammasso globulare ed uno
> galattico pu\xf2 non risultare del tutto immediata:', 'self':
> 'instance', 'userHandle': '[EMAIL PROTECTED]'}

Do you know what character that should have been? In latin1 (and
windows-1252), 0xf2 is ò (U+F2 LATIN SMALL LETTER O WITH GRAVE). Does
that character make sense in that sentence?

I wonder if anyone really knows whether the strings in the MSN protocol
are supposed to be utf-8 or windows-1252. It would be interesting to see
what the value of this string is when the personal message contains
other characters such as ‘ (U+2018 LEFT SINGLE QUOTATION MARK). If the
string is windows-1252 then that character would show up as '\x91'.

> this is probably because glue.py is not using the "errors" argument to
> string.decode() (see http://docs.python.org/lib/string-methods.html) which was
> introduced in 2.3.
> 
> I am not sure about the proper fix here, msnContact.personal.decode("utf-8",
> "ignore") does the trick, although "replace" looks saner. (I'm currently 
> using "replace")

Could you try msnContact.personal.decode ('windows-1252') and see if
that triggers any other errors?

In any case, it would certainly be sane to use 'replace'... or to fall
back to windows-1252 if utf-8 decoding fails. Perhaps the chardet module
could even be used.

> thanks,
> filippo

-- 
Sam Morris
http://robots.org.uk/

PGP key id 1024D/5EA01078
3412 EA18 1277 354B 991B  C869 B219 7FDB 5EA0 1078

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to