On 01/12/12 12:28, eryksun wrote:

UTF-8 was
designed to encode all of Unicode in a way that can seamlessly pass
through libraries that process C strings (i.e. an array of non-null
bytes terminated by a null byte). Byte values less than 128 are ASCII;
beyond ASCII, UTF-8 uses 2-4 bytes, and all byte values are greater
than 127, with standardized byte order. In contrast, UTF-16 and UTF-32
have null bytes in the string and platform-determined byte order. The
length and order of the optional byte order mark (BOM) distinguishes
UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.

That's not quite right. The UTF-16BE and UTF-16LE character sets do
not take BOMs, because the encoding already specifies the byte order:

py> s = u'abçЙ'
py> s.encode('utf-16LE')
'a\x00b\x00\xe7\x00\x19\x04'
py> s.encode('utf-16BE')
'\x00a\x00b\x00\xe7\x04\x19'


In contrast, plain ol' UTF-16 with no BE or LE suffix is ambiguous without
a BOM, so it uses one:

py> s.encode('utf-16')
'\xff\xfea\x00b\x00\xe7\x00\x19\x04'


The same applies to UTF-32.


There's also a UTF-8 BOM used on Windows. Python calls this encoding
 "utf-8-sig".

UTF-8-sig, an abomination, but sadly not just a Microsoft abomination.
Google Docs also uses it.

Although the Unicode standard does allow using a BOM (not actually a
Byte Order Mark, more of a "UTF-8 signature"), doing so is annoying
and silly.



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to