Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Dorota Czaplejewicz Thu, 10 May 2018 02:44:21 -0700

On Tue, 08 May 2018 07:07:24 +0000
Silvan Jegen <[email protected]> wrote:

> On Mon, May 7, 2018 at 5:11 AM Joshua Watt <[email protected]> wrote:
> > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > specify any offset in the string as a byte offset. I have a few
> > reasons for this justification:  
> 
> I agree with this as well. I thought some more about how to spell out my
> gut feeling on this matter in more technical terms.
> 
> UTF-8 is a byte (sequence) representation of Unicode code points. This
> indicates to me that an offset within an UTF-8-encoded string should also
> be given in bytes. Specifying the offset in Unicode points mixes the
> abstraction of the Unicode code point with (one of) its representations as
> a byte sequence. This is reflected in the fact that an offset in Unicode
> code points is not applicable to the UTF-8 string without first processing
> the string.
> 
> Unicode code points do not give us that much either since what we most
> likely want are grapheme clusters anyway (which, like any more advanced
> Unicode processing, should be handled by a specialised library):
> http://utf8everywhere.org/#myth.strlen
> 
> 
> Cheers,
> 
> Silvan

This message made me feel obliged to turn my own gut feeling into words. This 
is not to be construed as an argument, but more of an explanation.

I view wayland protocols as rather high level: their responsibility is to 
specify the type and the purpose of the data they are transporting. In this 
case, the data is a Unicode string, and the purpose is display. Or, the data is 
a number and the purpose is indexing.

I think that when a protocol starts to specify the type and purpose, it can no 
longer be thought as high level. In this view, indexing a Unicode string in 
terms of bytes would be akin to indexing any other vector of Foo in bytes. (I 
didn't actually check if there is any other vector, or bytes type available in 
wayland).

As you noted, there is some mixing between abstraction levels in the protocol. 
Hardcoding that it's not *just* Unicode, but also the particular encoding 
(UTF-8) eliminates problems with byte indexing we would have encountered if we 
decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's 
always UTF-8 allows the protocol to use a tailoring indexing scheme. While I 
consider this a layer-breaking hack, nevertheless, this property partially 
counters the above reasoning.

* * *

To be honest, neither Unicode code points nor graphemes nor clusters are what 
we're truly looking for here. To understand what I mean, I recommend to play 
with this grapheme cluster:

नमस्ते

According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', 
'्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 
"pieces" long only.

Cheers,
Dorota

[0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html

pgp5NljID7Inq.pgp
Description: OpenPGP digital signature

_______________________________________________
wayland-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/wayland-devel

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Reply via email to