On Sat, 24 Jan 2015 09:52:32 +0100 Diederick Huijbers <[email protected]> wrote:
(I'm assuming the post was meant to be directed to the list - there are others there with more experience than me.) > Thanks for your explanation. Your describing a situation where you > load UTF-16 strings into Harfbuzz, though I'm using UTF-8 string. I > guess it's the same for UTF-8? I didn't look carefully enough at your post. I somehow though you got cluster numbers 0 to 10. You didn't; you got cluster numbers 0, 3, 6, 9, 10, 13, 16, 19, 20, 23 and 26. For UTF-8 input, the cluster numbers are the byte offsets of the first character corresponding to the HarfBuzz cluster. The reported cluster numbers are, by design, weakly monotonic as one progresses through the list of glyphs - increasing for LTR writing and decreasing for RTL writing. The ICU positions translate to byte offsets as: Position 0 = Byte offset 0 Position 1 = Byte offset 3 Position 2 = Byte offset 6 Position 3 = Byte offset 9 Position 4 = Byte offset 10 (previous character was ASCII space) Position 5 = Byte offset 13 Position 6 = Byte offset 16 Position 7 = Byte offset 19 Position 8 = Byte offset 20 Position 9 = Byte offset 23 Position 10 = Byte offset 26 Position 11 = Byte offset 29 (end of string, so no cluster, no glyphs) The ICU positions are 16-bit word offsets in UTF-16. I don't know if there is a UTF-8 interface; I believe ICU word segmentation that needs dictionary lookup is broken for UTF-8. > I'm still trying to find a solution to map ICU graphmemes to Harfbuzz > glyphs so I can calculate the X-offset of the caret I'm drawing. Can > someone maybe describe how to use the Harfbuzz API and/or ICU library > to do that? So, in a simple case, to locate a boundary at position 1, one progresses: Position 1 = byte offset 3 There is a cluster at '3', so add up the advance widths of glyphs in the previous clusters. That is the basic algorithm, which will work for straightforward writing systems like Vietnamese or Chinese so long as ligatures are avoided. The rule is slightly different for RTL scripts. Complications arise with ligatures and with Indic rearrangement. Unfortunately, Unicode took Devanagari as the prototypical Indic script, but half-forms are not an early Indic feature. Let us return to my Tai Tham example: <U+1A2F TAI THAM LETTER DA, U+1A60 TAI THAM SIGN SAKOT, U+1A45 TAI THAM LETTER WA, U+1A60, U+1A75 TAI THAM SIGN TONE-1, U+1A3F TAI THAM LETTER LOW YA, U+1A20 TAI THAM LETTER HIGH KA> The grapheme cluster starts and contents are: pos=0 byte offset=0 cpts: 1A2F, 1A60 pos=2 byte offset=6 cpts: 1A45, 1A60, 1A75 pos=5 byte offset=15 cpts: 1A3F pos=6 byte offset=18 cpts: 1A20 pos=7 byte offset=21 Harfbuzz reports two clusters, at offsets 0 and 18. The glyphs, with advance widths in brackets, are Cluster 0: uni1A2F(1212), uni1A601A45(0), uni1A75(0), uni1A601A3F(464) Cluster 18: uni1A20(1910) The boundary at pos=0 is at x = 0. The boundary at pos=6 is at x = 1212 + 464 = 1676. The boundary at pos=7 is at x = 1676 + 1910 = 3586. For pos=2, we have no data. The simple trick is to render the string up to pos=2. I have to admit I do not know the ins and outs of justification. When we do this, we get: Cluster 0: uni1A2F(1212), uni1A60(0) From this, we may decide that the boundary at pos = 2 is at x=1212. Note, however, the glyph uni1A60 does not appear in the rendering of the complete string! For pos=5, we repeat the trick and render the string up to pos=5. We then get: Cluster 0: , uni1A60(0), uni25CC(1787), uni1A75(0) From this we may decide that the boundary at pos=5 is at x = 1212 + 1787 = 2999. What has gone severely wrong here is the insertion of the dreaded dashed circle. This happens for this string with *old* versions of HarfBuzz such as the one LibreOffice is clearly using. My font clears up the dashed circle when there is a consonant following U+1A60 in some canonically equivalent string of Tai Tham characters, but leaves it in a case like this because the string is *linguistically wrong*. Even with up-to-date HarfBuzz, we still get a glyph for the substring that does not appear in the full string. However, the cursor position would then be calculated as x = 1212, i.e. the same as for the previous grapheme cluster boundary. This is not unreasonable, for the grapheme cluster merely leads to the addition of non-spacing glyphs. Note that if one just examined the rendering of the string between pos=2 and pos=5, the glyphs uni1A2F(1212), uni1A601A45(0) would be replaced by uni1A45(1212), yet another glyph which does not appear in the rendering of the complete string. I hope this helps. Richard. _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
