Again thanks for all the valuable feedback. I've looked into it a bit more and things are falling into place now. Though I couldn't find any concrete information on what the value of "cluster" in Harfbuzz means. Same with the value returned by the BreakIterator of ICU. I'm interpreting them as byte offsets which I think is correct
To get values in byte-offsets I had to use an UText instead of a UnicodeString in combination with the BreakIterator. When using a UnicodeString the data I pass into the constructor is converted from UTF-8 to UTF-16 and the values returned by the BreakIterator wouldn't align with the byte-offsets in the clusters of Harfbuzz (hb_glyph_info_t). My current thinking to calculate the caret position is as follows: *Lets say I want to position the caret just before the 2nd graph meme: * - find the byte offset of the 2nd graphmeme (using BreakIterator) - find the HB-cluster to which the graphmeme belongs based on the byte-offset - using the start and end byte offsets of the cluster, check how many graphmemes are part of the HB-cluster. We divide the x_advance by this number so we know how much we need to move the cursor per graphmeme in the cluster. I created an image that clarifies the meaning of graphmemes, glyphs, clusters and the byte values. You can find the image here: https://www.flickr.com/photos/diederick/15749726814/ Just wanted to share this approach and hopefully get some feedback. Best D On Sat, Jan 24, 2015 at 3:43 PM, Diederick Huijbers ☾ < [email protected]> wrote: > Hi Richard, > > It seems that gmail automatically replied to your email address, not to > the list. > > I'll paste my message here again: > > ---- > > I've posted some test code which uses Freetype to load a font, > Harfbuzz for shaping and ICU to get the graphmemes. This is all > experimental and I cannot verify if my code is the best/correct way. > > But this is a start that I'm using to calculate the caret offset for > strings with ligatures. It does not yet contain the code to do this. > > https://gist.github.com/roxlu/da3251cb2045823922fa > > Needs to link with ICU, Freetype and Harfbuzz. > > D. > > --- > > Thanks for your answer; I see how I can arrive at the byte offsets when > thinking about it, but not how to use ICU / Harfbuzz. > > > > > On Sat, Jan 24, 2015 at 3:34 PM, Richard Wordingham < > [email protected]> wrote: > >> On Sat, 24 Jan 2015 13:45:37 +0100 >> Diederick Huijbers ☾ <[email protected]> wrote: >> >> > Thanks so much Richard, one question though .... (see below) >> >> Please reply to the list ( [email protected] ), not just >> to me. >> >> > > The ICU positions translate to byte offsets as: >> >> > > Position 0 = Byte offset 0 >> > > Position 1 = Byte offset 3 >> > > Position 2 = Byte offset 6 >> > > Position 3 = Byte offset 9 >> > > Position 4 = Byte offset 10 (previous character was ASCII space) >> > > Position 5 = Byte offset 13 >> > > Position 6 = Byte offset 16 >> > > Position 7 = Byte offset 19 >> > > Position 8 = Byte offset 20 >> > > Position 9 = Byte offset 23 >> > > Position 10 = Byte offset 26 >> > > Position 11 = Byte offset 29 (end of string, so no cluster, no >> > > glyphs) >> >> > > The ICU positions are 16-bit word offsets in UTF-16. I don't know >> > > if there is a UTF-8 interface; I believe ICU word segmentation that >> > > needs dictionary lookup is broken for UTF-8. >> >> > How did you arrive to this mapping? I'm wondering what structs hold >> > these information. >> >> If it's precomputed for you, I think that will be done by ICU rather >> than by HarfBuzz. >> >> I know the lengths of Unicode characters (by codepoint) in the UTF-8 and >> UTF-16 encodings. I also knew that the HarfBuzz cluster numbers would >> be byte offsets, so I checked my workings that way. I would >> generate such a table by stepping through the string, character by >> character. Strictly, one should ensure that the UTF-8 string consists >> only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as >> UTF-8. I would treat surrogate codepoints (U+D800 to U+DFFF) as >> corresponding to two UTF-8 bytes. If the string originates as a >> sequence of characters in UTF-8, there will be no lone surrogates to >> create trouble. >> >> I would test the generation of this conversion table using a mixture of >> 1-byte, 2-byte and 4-byte characters. >> >> Richard. >> _______________________________________________ >> HarfBuzz mailing list >> [email protected] >> http://lists.freedesktop.org/mailman/listinfo/harfbuzz >> > > > > -- > Apollo +++++++++ > Interactive Media > +++++++++++++++ > Diederick Huijbers === > [email protected] > ==================== > Zeeburgerpad 74 :::::::: > 1019 AD Amsterdam > mobile 06 - 12 44 09 22 > phone 020 - 707 78 96 > //\\//\\//\\//\\//\\//\\//\\//\\//\\ > www.apollomedia.nl +++ > ++++++++++++++++ >
_______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
