[EMAIL PROTECTED] wrote: > > If you are looking for a ready-made too, I don't know. > If you are looking for the spec, I got the following from the Unicode > Standard, version 3.0: > > Scalar value UTF-16 1st byte 2nd byte 3rd byte 4th byte > 000000000xxxxxxx 000000000xxxxxxx 0xxxxxxx > 00000yyyyyxxxxxx 00000yyyyyxxxxxx 110yyyyy 10xxxxxx > zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx > uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx > 110111yyyyxxxxxx > > where uuuuu = wwww+1 (to account for the addition of 10000 base 16 as in > Section 3.7, surrogates) > > When converting a Unicode scalar value to UTF-8, the shortest form that > can represent those values shall be used. This practice preserves > uniqueness of coding. For example, the Unicode buinary value > <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. > The latter is an example of an irregular UTF-8 bute sequence. Irregular > UTF-8 sequences shall not be used foe encoding any other information. > > To which I add that Java, in particular, uses an erregulat UTF-8 > sequence to encode the <0000000000000000> character, so that it can > encode it unambiguously in an environment that would otherwise use an > all-zero byte to indicate end-of-string. > > -- hendrik >
thanks for the information! ready to read it~ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]