Jan Arne Petersen wrote:

* Changes offsets to be Unicode character instead of byte based

No, PLEASE DON'T DO THIS!!!

You think you are making things "easier" but you are making it much much harder. You may not believe it, but "how many characters are in this UTF-8" will generate dozens of different answers and should never be used as part of a communication api. Possible differences:

1. A lot of things really count UTF-16 code units, not Unicode code points, due to being designed for Windows.

2. Handling of invalid byte sequences. Some consider one byte a character, some consider up to 4 bytes stopping at the first byte that fails the UTF-8 parsing, some consider all trailing bytes no matter how long, some consider the N bytes determined by the lead byte no matter what they are (the first is the most common and the first two are the only ones recommended, but the others exist, sometimes multiple rules in the same decoder!). And don't you dare spout the nonsense that somehow invalid byte sequences won't happen, or that if they are there it is "not UTF-8" and thus somehow saying this means it will magically not ever go through the API.

3. Disagreement about whether the encoding of UTF-16 surrogate halves, the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters, code points greater than 0x10FFFF, etc, are "characters" or "errors". If errors many decoders count them as 3 or 4 characters rather than one.

4. How to count combining characters.

5. How to count double-width characters, tabs, various whitespace.

6. Normalization. Almost anything that actually wants to decode Unicode (other than to translate it to UTF-16 for Windows filenames) wants to do extra analysis and will do normalization. This is hundreds of pages of documentation from Unicode and certainly should not be part of a low-level api.

PS: You will notice that Windows and everything else working with UTF-16 count the surrogate pairs as 2 units. For reasons that totally baffle me, the very same people who say "oh you must measure your UTF-8 in 'character'" see nothing wrong with this! Why don't you think a little: go change all your UTF-16 code to measure "characters" and realize what a STUPID idea it is.

_______________________________________________
wayland-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/wayland-devel

Reply via email to