On Fri, 23 Jan 2015 22:45:19 +0100 Diederick Huijbers <[email protected]> wrote:
> This seems to be a 1:1 match, but my biggest question is how I can > map the ICU boundaries > to the correct HB-buffer/clusters? HarfBuzz cluster n starts at position n, assuming you're loading UTF-16 strings into HarfBuzz. Some ICU boundaries will not have a corresponding boundary. For example, when rendering English, the three character string "fit" will have ICU boundaries at positions 0, 1, 2 and 3, but, for a good font, there will be only two HarfBuzz clusters, those starting at positions 0 and 2. The reason is that for English, "fi" is best rendered by a ligature. While ligatures can mostly be handled by evenly splitting the glyph between the components, sometimes this is spectacularly wrong. For example, Khmer <U+1780 KHMER LETTER KA, U+17D2 KHMER SIGN COENG, U+179A KHMER LETTER RO> formally splits into grapheme clusters <U+1780, U+17D2> and <U+179A>, but although it appears as two glyphs, the left-hand one derives from <U+17D2, U+179A> and the right-hand one derives from <U+1780>. Harfbuzz reports the string as single cluster. > *String manipulation:* > When I want the user to manipulate the text inside the input field, > with e.g. delete > and backspace keys, should I manipulate the graphmemes? or the UTF-8 > codepoints? > or maybe something else? Standard practice is to kick the users of complex scripts in the teeth and deny them access to characters inside a 'grapheme cluster'. (In one script I work with, having 3 or 4 marks within a grapheme cluster is not unusual. Correcting the base character is impossible - I have to retype the entire cluster.) Deleting backwards just deletes one character, while deleting forwards deletes a whole grapheme cluster. The left and right arrows move one grapheme cluster at a time. I haven't worked out how cursor positioning is done for grapheme clusters merged by ligatures. Perhaps it is done by interpolation for European scripts and simply given up on for Indic scripts, snapping the cursor to the boundaries of the Harfbuzz cluster. LibreOffice 4.3.3.2 currently gets very confused by the sequence <U+1A2F TAI THAM LETTER DA, U+1A60 TAI THAM SIGN SAKOT, U+1A45 TAI THAM LETTER WA, U+1A60, U+1A75 TAI THAM SIGN TONE-1, U+1A3F TAI THAM LETTER LOW YA, U+1A20 TAI THAM LETTER HIGH KA>. <U+1A60, U+1A45>, and <U+1A75> are non-spacing glyphs. <U+1A60, U+1A3F> is a spacing combining mark which starts below the base character. The grapheme clusters are <U+1A2F, U+1A60>, <U+1A45, U+1A60, U+1A75>, <U+1A3F> and <U+1A20>. The successive cursor positions are: Before U+1A2F (correct) After U+1A2F (defensible) 3/4 of the way through U+1A20 (wildly wrong!) Before U+1A20 (correct) After U+1A20 (correct) A civilised method of cursor positioning for knowledgeable users is to disable shaping of a cluster when the cursor is within the cluster - the user can then see what he is doing. This is particularly useful if transposing characters results in a visually identical but canonically inequivalent string. The disadvantage is that there may be significant reflow issues when working with paragraphs. There doesn't seem to be a convention for switching between stepping by grapheme and stepping by character. Richard. _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
