On 01/12/12 12:28, eryksun wrote:
UTF-8 was designed to encode all of Unicode in a way that can seamlessly pass through libraries that process C strings (i.e. an array of non-null bytes terminated by a null byte). Byte values less than 128 are ASCII; beyond ASCII, UTF-8 uses 2-4 bytes, and all byte values are greater than 127, with standardized byte order. In contrast, UTF-16 and UTF-32 have null bytes in the string and platform-determined byte order. The length and order of the optional byte order mark (BOM) distinguishes UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
That's not quite right. The UTF-16BE and UTF-16LE character sets do not take BOMs, because the encoding already specifies the byte order: py> s = u'abçЙ' py> s.encode('utf-16LE') 'a\x00b\x00\xe7\x00\x19\x04' py> s.encode('utf-16BE') '\x00a\x00b\x00\xe7\x04\x19' In contrast, plain ol' UTF-16 with no BE or LE suffix is ambiguous without a BOM, so it uses one: py> s.encode('utf-16') '\xff\xfea\x00b\x00\xe7\x00\x19\x04' The same applies to UTF-32.
There's also a UTF-8 BOM used on Windows. Python calls this encoding "utf-8-sig".
UTF-8-sig, an abomination, but sadly not just a Microsoft abomination. Google Docs also uses it. Although the Unicode standard does allow using a BOM (not actually a Byte Order Mark, more of a "UTF-8 signature"), doing so is annoying and silly. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor