On 18/05/13 05:23, Albert-Jan Roskam wrote:

I was curious what the "high" four-byte ut8 unicode characters look like.

By the way, your sentence above reflects a misunderstanding. Unicode characters (strictly 
speaking, code points) are not "bytes", four or otherwise. They are abstract 
entities represented by a number between 0 and 1114111, or in hex, 0x10FFFF. Code points 
can represent characters, or parts of characters (e.g. accents, diacritics, combining 
characters and similar), or non-characters.

Much confusion comes from conflating bytes and code points, or bytes and 
characters. The first step to being a Unicode wizard is to always keep them 
distinct in your mind. By analogy, the floating point number 23.42 is stored in 
memory or on disk as a bunch of bytes, but there is nothing to be gained from 
confusing the number 23.42 from the bytes 0xEC51B81E856B3740, which is how it 
is stored as a C double.

Unicode code points are abstract entities, but in the real world, they have to be stored 
in a computer's memory, or written to disk, or transmitted over a wire, and that requires 
*bytes*. So there are three Unicode schemes for storing code points as bytes. These are 
called *encodings*. Only encodings involve bytes, so it is nonsense to talk about 
"four-byte" unicode characters, since it conflates the abstract Unicode 
character set with one of various concrete encodings.

There are three standard Unicode encodings. (These are not to be confused with the dozens 
of "legacy encodings", a.k.a. code pages, used prior to the Unicode standard. 
They do not cover the entire range of Unicode, and are not part of the Unicode standard.) 
These encodings are:

UTF-8
UTF-16
UTF-32 (also sometimes known as UCS-4)

plus at least one older, obsolete encoding, UCS-2.

UTF-32 is the least common, but simplest. It simply maps every code point to 
four bytes. In the following, I will follow this convention:

- code points are written using the standard Unicode notation, U+xxxx where the 
x's are hexadecimal digits;

- bytes are written in hexadecimal, using a leading 0x.

Code point U+0000 -> bytes 0x00000000
Code point U+0001 -> bytes 0x00000001
Code point U+0002 -> bytes 0x00000002
...
Code point U+10FFFF -> bytes 0x0010FFFF


It is simple because the mapping is trivially simple, and uncommon because for 
typical English-language text, it wastes a lot of memory.

The only complication is that UTF-32 depends on the endianess of your system. 
In the above examples I glossed over this factor. In fact, there are two common 
ways that bytes can be stored:

- "big endian", where the most-significant (largest) byte is on the left 
(lowest address);
- "little endian", where the most-significant (largest) byte is on the right.

So in a little-endian system, we have this instead:

Code point U+0000 -> bytes 0x00000000
Code point U+0001 -> bytes 0x01000000
Code point U+0002 -> bytes 0x02000000
...
Code point U+10FFFF -> bytes 0xFFFF1000

(Note that little-endian is not merely the reverse of big-endian. It is the 
order of bytes that is reversed, not the order of digits, or the order of bits 
within each byte.)

So when you receive a bunch of bytes that you know represents text encoded 
using UTF-32, you can bunch the bytes in groups of four and convert them to 
Unicode code points. But you need to know the endianess. One way to do that is 
to add a Byte Order Mark at the beginning of the bytes. If you look at the 
first four bytes, and it looks like 0x0000FEFF, then you have big-endian 
UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.

So that's UTF-32. UTF-16 is a little more complicated.

UTF-16 divides the Unicode range into two groups:

* The first (approximately) 65000 code points which are represented as two 
bytes;

* Everything else, which are represented as a pair of double bytes, so-called 
"surrogate pairs".

For the first 65000-odd code points, the mapping is trivial, and relatively 
compact:

code point U+0000 => bytes 0x0000
code point U+0001 => bytes 0x0001
code point U+0002 => bytes 0x0002
...
code point U+FFFF => bytes 0xFFFF


Code points beyond that point are encoded into a pair of double bytes (four 
bytes in total):

code point U+10000 => bytes 0xD800 DC00
...
code point U+10FFFF => bytes 0xDBFF DFFF


Notice a potential ambiguity here. If you receive a byte 0xD800, is that the 
start of a surrogate pair, or the code point U+D800? The Unicode standard 
resolves this ambiguity by officially reserving code points U+D800 through 
U+DFFF for use as surrogate pairs in UTF-16.

Like UTF-32, UTF-16 also has to distinguish between big-endian and 
little-endian. It does so with a leading BOM, only this time it is two bytes, 
not four:

0xFEFF => big-endian
0xFFFE => little-endian


Last but not least, we have UTF-8. UTF-8 is slowly becoming the standard for 
storing Unicode on disk, because it is very compact for common English-language 
text, backwards-compatible with ASCII text files, and doesn't require a BOM. 
(Although Microsoft software sometimes adds a UTF-8 signature at the start of 
files, namely the three bytes 0xEFBBBF.)

UTF-8 is also a variable-width encoding. Unicode code-points are mapped to one, 
two, three or four bytes, as needed:

Code points U+0000 to U+007E => 1 byte
Code points U+0080 to U+07FF => 2 bytes
Code points U+0800 to U+FFFF => 3 bytes
Code points U+10000 to U+10FFFF => 4 bytes

(Older versions of UTF-8 could go up to six bytes, but now that Unicode is 
officially limited to exactly 0x10FFFF code points, it now only goes up to four 
bytes.)



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to