On Sun, May 06, 2018 at 10:37:57PM +0200, Dorota Czaplejewicz wrote: > On Sat, 5 May 2018 13:37:44 +0200 > Silvan Jegen <[email protected]> wrote: > > > On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote: > > > On Fri, 4 May 2018 22:32:15 +0200 > > > Silvan Jegen <[email protected]> wrote: > > > > > > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote: > > > > > On Thu, 3 May 2018 21:55:40 +0200 > > > > > Silvan Jegen <[email protected]> wrote: > > > > > > [...] > > > > > > In the end, I'm not an expert in that area either - perhaps treating > > > client side strings as UTF-8 buffers makes sense, but at the moment > > > I'm still leaning towards the code point abstraction. > > > > Someone (™) should probably implement a client making use of the protocol > > to see what the real world impact of this protocol change would be. > > > > The editor in the weston project uses pango for its text layout: > > > > https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824 > > > > so it would have to parse the UTF-8 string twice. The same is most likely > > true for all programs using GTK... > > > > > > I made an attempt to dig deeper, and while I stopped short of becoming > this Someone for now, I gathered what I think are some important > results. > > First, the state of the libraries. There's a lot of data I gathered, > so I'll keep this section rather dense. First, another contender > for the title of text layout library, and that one uses code points > exclusively: > > https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h > `gr_make_seg` > > https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c > > Afterwards, I focused on GTK and Qt. As an input method plugin > developer, I looked at the IM interfaces and internal data structures > they expose. The results were not that clear - no mention of "code > points", some references to "bytes", many to "characters" (not > "chars"). What is certain is that there's a lot of converting going on
Yes, it's very unfortunate that a lot of developers do not strife for more clarity and precision in terminology when processing text. > behind the scenes anyway. First off, GTK seems to be moving away from > bytes, judging by the comments: > > gtk 3.22 (`gtkimcontext.c`) > > `gtk_im_context_delete_surrounding` > > > * Asks the widget that the input context is attached to to delete > > * characters around the cursor position by emitting the > > * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars > > * are in characters not in bytes which differs from the usage other > > * places in #GtkIMContext. > > `gtk_im_context_get_preedit_string` > > > * @cursor_pos: (out): location to store position of cursor (in characters) > > * within the preedit string. > > `gtk_im_context_get_surrounding` > > > * @cursor_index: (out): location to store byte index of the insertion > > * cursor within @text. > > gtkEntry seems to store things internally as characters. They mention "characters" but what they most likely mean are Unicode code points. One would think they would try to keep their APIs consistent but that doesn't seem to be the case. > While GTK using code points internally is not a proof of anything, > it's a suggestion that there is a reason not to use bytes. > > Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString > > > replaceLength specifies the number of characters to be replaced > > a confirmation that "characters" means "code points" comes from > https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value > reported when "æþ|" is displayed is 2. https://doc.qt.io/qt-5/qstring.html Qt uses UTF-16 internally so they *could* also be counting "QChars" which are 16-bit (assuming the position is 0 indexed): Python 3.6.5 (default, Apr 14 2018, 13:17:30) [GCC 7.3.1 20180406] on linux Type "help", "copyright", "credits" or "license" for more information. >>> "æþ" 'æþ' >>> "æþ".encode("utf-16") b'\xff\xfe\xe6\x00\xfe\x00' If they are really doing that you would only notice it with characters outside of the BMP because: "(Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)" I think everybody agrees that (Unicode) text handling is a mess in general... > I also spent more time than I should writing a demo implementation > of an input method and a client connecting to it to check out the > proposed interfaces. Predictably, it gave me a lot of trouble > on the edges between bytes and code points, but I blame it on > Rust's scarcity of UTF handling functions. The hack is available at > https://code.puri.sm/dorota.czaplejewicz/impoc Thanks for taking the time! I compiled and ran it but my rust is weak... Rust has an interesting String type: https://doc.rust-lang.org/std/string/struct.String.html#utf-8 It's UTF-8 encoded but you are not allowed to index into it. > My impression at the moment is that it doesn't matter much how offsets > within UTF strings are encoded, but that code points slightly better > reflect what's going on in the GUI toolkits, apart from the benefits > mentioned in my other emails. There seems to be so much going on > behind the scenes and the parsing is so cheap that it doesn't make > sense to worry about the computational aspect, just try to make things > easier to get right. > > Unless someone chimes in with more arguments, I'm going to keep using > code points in following revisions. The only argument I have for using byte offsets instead of Unicode code points is that you will have to parse the UTF-8 string twice in case your text rendering library lets you only use byte lengths. That seems to be the case for pango, which I assume is commonly used. If I come up with more arguments I will send another mail... Cheers, Silvan
signature.asc
Description: PGP signature
_______________________________________________ wayland-devel mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/wayland-devel
