Re: [PATCH 0/5] Improve text protocol

Bill Spitzak Mon, 15 Apr 2013 12:14:42 -0700

Jan Arne Petersen wrote:

* Changes offsets to be Unicode character instead of byte based


No, PLEASE DON'T DO THIS!!!

You think you are making things "easier" but you are making it much muchharder. You may not believe it, but "how many characters are in thisUTF-8" will generate dozens of different answers and should never beused as part of a communication api. Possible differences:

1. A lot of things really count UTF-16 code units, not Unicode codepoints, due to being designed for Windows.

2. Handling of invalid byte sequences. Some consider one byte acharacter, some consider up to 4 bytes stopping at the first byte thatfails the UTF-8 parsing, some consider all trailing bytes no matter howlong, some consider the N bytes determined by the lead byte no matterwhat they are (the first is the most common and the first two are theonly ones recommended, but the others exist, sometimes multiple rules inthe same decoder!). And don't you dare spout the nonsense that somehowinvalid byte sequences won't happen, or that if they are there it is"not UTF-8" and thus somehow saying this means it will magically notever go through the API.

3. Disagreement about whether the encoding of UTF-16 surrogate halves,the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters,code points greater than 0x10FFFF, etc, are "characters" or "errors". Iferrors many decoders count them as 3 or 4 characters rather than one.


4. How to count combining characters.

5. How to count double-width characters, tabs, various whitespace.

6. Normalization. Almost anything that actually wants to decode Unicode(other than to translate it to UTF-16 for Windows filenames) wants to doextra analysis and will do normalization. This is hundreds of pages ofdocumentation from Unicode and certainly should not be part of alow-level api.

PS: You will notice that Windows and everything else working with UTF-16count the surrogate pairs as 2 units. For reasons that totally baffleme, the very same people who say "oh you must measure your UTF-8 in'character'" see nothing wrong with this! Why don't you think a little:go change all your UTF-16 code to measure "characters" and realize whata STUPID idea it is.


_______________________________________________
wayland-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/wayland-devel

Re: [PATCH 0/5] Improve text protocol

Reply via email to