On Tue, 08 May 2018 07:07:24 +0000 Silvan Jegen <[email protected]> wrote:
> On Mon, May 7, 2018 at 5:11 AM Joshua Watt <[email protected]> wrote: > > IMHO, if you are doing UTF-8 (which you should), you should *always* > > specify any offset in the string as a byte offset. I have a few > > reasons for this justification: > > I agree with this as well. I thought some more about how to spell out my > gut feeling on this matter in more technical terms. > > UTF-8 is a byte (sequence) representation of Unicode code points. This > indicates to me that an offset within an UTF-8-encoded string should also > be given in bytes. Specifying the offset in Unicode points mixes the > abstraction of the Unicode code point with (one of) its representations as > a byte sequence. This is reflected in the fact that an offset in Unicode > code points is not applicable to the UTF-8 string without first processing > the string. > > Unicode code points do not give us that much either since what we most > likely want are grapheme clusters anyway (which, like any more advanced > Unicode processing, should be handled by a specialised library): > http://utf8everywhere.org/#myth.strlen > > > Cheers, > > Silvan This message made me feel obliged to turn my own gut feeling into words. This is not to be construed as an argument, but more of an explanation. I view wayland protocols as rather high level: their responsibility is to specify the type and the purpose of the data they are transporting. In this case, the data is a Unicode string, and the purpose is display. Or, the data is a number and the purpose is indexing. I think that when a protocol starts to specify the type and purpose, it can no longer be thought as high level. In this view, indexing a Unicode string in terms of bytes would be akin to indexing any other vector of Foo in bytes. (I didn't actually check if there is any other vector, or bytes type available in wayland). As you noted, there is some mixing between abstraction levels in the protocol. Hardcoding that it's not *just* Unicode, but also the particular encoding (UTF-8) eliminates problems with byte indexing we would have encountered if we decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a tailoring indexing scheme. While I consider this a layer-breaking hack, nevertheless, this property partially counters the above reasoning. * * * To be honest, neither Unicode code points nor graphemes nor clusters are what we're truly looking for here. To understand what I mean, I recommend to play with this grapheme cluster: नमस्ते According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 "pieces" long only. Cheers, Dorota [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
pgp5NljID7Inq.pgp
Description: OpenPGP digital signature
_______________________________________________ wayland-devel mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/wayland-devel
