On 18/05/13 20:01, Albert-Jan Roskam wrote:

Thanks for all your replies. I knew about code points, but to represent the 
unicode string (code point) as a utf-8 byte string (bytes), characters 0-127 
are 1 byte (of 8 bits), then 128-255 (accented chars)
are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on Joel Spolsky's 
"standard" page about unicode I read that it goes to 6 bytes. That's what I implied when 
I mentioned "utf8".

The UTF-8 data structure was originally designed to go up to 6 bytes, but since 
Unicode itself is limited to 1114111 code points, no more than 4 bytes are 
needed for UTF-8.

Also, it is wrong to say that the 4-byte UTF-8 values are "East Asian languages". The full Unicode 
range contains 17 "planes" of 65,536 code points. The first such plane is called the "Basic 
Multilingual Plane", and it includes all the code points that can be represented in 1 to 3 UTF-8 bytes. 
The BMP includes in excess of 13,000 East Asian code points, e.g.:


py> import unicodedata as ud
py> c = '\u3050'
py> print(c, ud.name(c), c.encode('utf-8'))
P HIRAGANA LETTER GU b'\xe3\x81\x90'


The 4-byte UTF-8 values are in the second and subsequent planes, called 
"Supplementary Multilingual Planes". They include historical character sets 
such as Egyptian hieroglyphs, cuneiform, musical and mathematical symbols, Emoji, gaming 
symbols, Ancient Arabic and Persian, and many others.

http://en.wikipedia.org/wiki/Plane_(Unicode)


I always viewed the codepage as "the bunch of chars on top of ascii", e.g. 
cp1252 (latin-1) is ascii (0-127) +  another 128 characters that are used in Europe (euro 
sign, Scandinavian and Mediterranean (Spanish), but not Slavian chars).

Well, that's certainly common, but not all legacy encodings are supersets of 
ASCII. For example:

http://en.wikipedia.org/wiki/Big5

although I see that Python's implementation of Big5 is *technically* incorrect, 
although *practically* useful, as it does include ASCII.


A certain locale implies a certain codepage (on Windows), but where does the 
locale category LC_CTYPE fit in this story?

No idea :-)




UTF-8
UTF-16
UTF-32 (also sometimes known as UCS-4)

plus at least one older, obsolete encoding, UCS-2.

Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe 
this is a different abbreviation. I read about bit multilingual plane (BMP) and 
surrogate pairs and all. The author suggested that messing with surrogate pairs 
is a topic to dive into in case one's nail bed is being derusted. I 
wholeheartedly agree.

UCS-2 is a fixed-width encoding that is identical to UTF-16 for code points up 
to U+FFFF. It differs from UTF-16 in that it *cannot* encode code points 
U+10000 and higher, in other words, it does not support surrogate pairs. So 
UCS-2 is obsolete in the sense it doesn't include the whole set of Unicode 
characters.

In Python 3.2 and older, Python has a choice between a *narrow build* that uses 
UTF-16 (including surrogates) for strings in memory, or a *wide build* that 
uses UTF-32. The choice is made when you compile the Python interpreter. Other 
programming languages may use other systems.

Python 3.3 uses a different, more flexible scheme for keeping strings in 
memory. Depending on the largest code point in a string, the string will be 
stored in either Latin-1 (one byte per character), UCS-2 (two bytes per 
character, and no surrogates) or UTF-32 (four bytes per character). This means 
that there is no longer a need for surrogate pairs, but only strings that 
*need* four bytes per character will use four bytes.



- "big endian", where the most-significant (largest) byte is on the left 
(lowest address);
- "little endian", where the most-significant (largest) byte is on the right.


Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is "utf-8" a 
shorthand for saying "utf-8 le"?

Endianness is relevant for UTF-16 too.

It is not relevant for UTF-8 because UTF-8 defines the order that multiple 
bytes must appear. UTF-8 is defined in terms of *bytes*, not multi-byte words. 
So the code point U+3050 is encoded into three bytes, *in this order*:

0xE3 0x81 0x90

There's no question about which byte comes first, because the order is set. But 
UTF-16 defines the encoding in terms of double-byte words, so the question of 
how words are stored becomes relevant. A 16-bit word can be laid out in memory 
in at least two ways:

[most significant byte] [least significant byte]

[least significant byte] [most significant byte]

so U+3050 could legitimately appear as bytes 0x3050 or 0x5030 depending on the 
machine you are using.

It's hard to talk about endianness without getting confused, or at least for me 
it is :-) Even though I've written down 0x3050 and 0x5030, it is important to 
understand that they both have the same numeric value of 12368 in decimal. The 
difference is just in how the bytes are laid out in memory. By analogy, Arabic 
numerals used in English and other Western languages are written in *big endian 
order*:

1234 means 1 THOUSAND 2 HUNDREDS 3 TENS 4 UNITS

Imagine a language that wrote numbers in *little endian order*, but using the 
same digits. You would count:

0
1
2
...
01  # no UNITS 1 TEN
11  # 1 UNITS 1 TEN
21  # 2 UNITS 1 TEN
...
4321  # 4 UNITS 3 TENS 2 HUNDREDS 1 THOUSAND


Since both UTF-16 and UTF-32 are defined in terms of 16 or 32 bit words, 
endianness is relevant; since UTF-8 is defined in terms of 8-bit bytes, it is 
not.

Fortunately, all(?) modern computing hardware has standardized on the same 
"endianness" of individual bytes. This was not always the case, but today if 
you receive a byte with bits:

0b00110000

then there is no(?) doubt that it represents decimal 48, not 12.



So when you receive a bunch of bytes that you know represents text encoded 
using UTF-32, you can bunch the bytes in groups of four and convert them to 
Unicode code points. But you need to know the endianess. One way to do that is 
to add a Byte Order Mark at the beginning of the bytes. If you look at the 
first four bytes, and it looks like 0x0000FEFF, then you have big-endian 
UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.

So each byte starts with a BOM? Or each file? I find utf-32 indeed the easiest 
to understand.

Certainly not each byte! That would be impossible, since the BOM itself is *two 
bytes* for UTF-16 and *four bytes* for UTF-32.

Remember, a BOM is not compulsory. If you decide before hand that you will 
always use big-endian UTF-16, say, there is no need to waste time with a BOM. 
But then you're responsible for producing big-endian words even if your 
hardware is little-endian.

A BOM is useful when you're transmitting a file to somebody else, and they *might* not 
have the same endianness as you. If you can pass a message on via some other channel, you 
can say "I'm about to send you a file in little-endian UTF-16" and all will be 
good. But since you normally can't, you just insert the BOM at the start of the file, and 
they can auto-detect the endianness.

How do they do that? Because they read the first two bytes. If they read it as 
0xFFFE, that tells them that their byte-order and my byte-order are mismatched, 
and they should just use the opposite byte-order from what their system uses by 
default. If they read it as 0xFEFF, our endianness match, and we're right to go.

You can stick a BOM at the beginning of every string, but that's rather 
wasteful, and it leads to difficulty with string processing (especially 
concatenating strings), so it's best not to use BOMs except *at most* once per 
file.


In utf-8, how does a system "know" that the given octet of bits is to be interpreted as a 
single-byte character, or rather like "hold on, these eight bits are gibberish as they are right now, 
let's check what happens if we add the next eight bits", in other words a multibyte char (forgive me the 
naive phrasing ;-). Why I mention is in the context of BOM: why aren't these needed to indicate 
"mulitbyte char ahead!"?

Because UTF-8 is a very cunning system that was designed by very clever people 
(Dave Prosser and Ken Thompson) to be unambiguous when read one byte at a time.

When reading a stream of UTF-8 bytes, you look at the first bit of the current 
byte. If it is a zero, then you have a single-byte code, so you can decode that 
byte and move on to the next byte. A single byte with a leading 0 gives you 127 
possible different values. (If this sounds like ASCII, that's not a 
coincidence.)

But if the current byte starts with bits 110, then you throw those three bits 
away, and keep the next five bits. Then you read the next byte, check that it 
starts with bits 10, and keep the six bits following that. That gives you 5+6 = 
11 useful bits in total, from two bytes read, which is enough to encode a 
further 2047 distinct values.

If the current byte starts with bits 1110, then you throw those four bits away 
and keep the next four. Then you read in two more bytes, check that they both 
start with bits 10, and keep the next six bits from each. This gives you 4+6+6 
= 16 bits in total, which encodes a further 65535 values.

If the current byte starts with 11110, you throw away those five bits and read 
in the next three bytes. This gives you 3+6+6+6 = 21 bits, which is enough to 
encode 2097151 values. So in total, that gives you 127+2047+65535+2097151 = 
2164860 distinct values, which is more than the number we actually need.

(Notice that the number of leading 1s in the first byte tells you how many 
bytes you need to read. Also note that not all byte sequences are valid UTF-8.) 
In summary:

U+0000 - U+007F => 0xxxxxxx
U+0080 - U+07FF => 110xxxxx 10xxxxxx
U+0800 - U+FFFF => 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+1FFFFF => 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx



So that's UTF-32. UTF-16 is a little more complicated.

UTF-16 divides the Unicode range into two groups:

* The first (approximately) 65000 code points which are represented as two 
bytes;

* Everything else, which are represented as a pair of double bytes, so-called 
"surrogate pairs".


Just as I thought I was starting to understand it.... Sorry. 
len(unichr(63000).encode("utf-8")) returns three bytes.

You're using UTF-8. I'm talking about UTF-16.





--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to