On Sat, 23 May 2020 09:09:48 +0300 Eli Zaretskii <[email protected]> wrote:
> > Date: Fri, 22 May 2020 22:22:49 +0100 > > From: Richard Wordingham <[email protected]> > > > > > The current support for producing ligatures works in the same way > > > as complex text shaping for scripts that require that, like > > > Arabic and Khmer: the sequences of characters that can be > > > displayed as ligatures are identified in advance with suitable > > > regular expressions, and the display engine then passes these > > > sequences to hb_shape to produce the ligatures. > > > > > > This works well for scripts that require complex shaping, because > > > such scripts generally have well-defined rules for the sequences > > > of codepoints that need shaping. > > > > They may of course have more than one set of such rules, with the > > rule sets defining different sets of sequences. > > Who are "they" in this context? Devanagari and Tai Tham are two examples I am aware of. Devanagari has different rules for positioning of Vedic marks between fonts using the script tags dev and dev2 for it on one hand and the unofficial script tag dev3, which follows the USE rules for character ordering. For tag dev, Microsoft says that <consonant, virama, candrabindu, consonant> is one cluster; others, including Unicode, say it's two. Candrabindu in the middle and candrabindu at the end mean different things; the former nasalises a consonant, while the latter nasalises a vowel. The visual distinction exists, at least when half-forms are used. Tai Tham has an issue with the mark U+1A58 TAI THAM SIGN MAI KANG LAI. It is, at least formally, a non-spacing mark. It occurs at the juncture of two syllables in the same words. Modern, printed Tai Khuen happily treats it as syllable-final. In more traditional styles, it starts syllables, going above the first consonant, and so to the right of a vowel mark reordered to the left hand side of the syllable. Some fonts seem to just let it hang over the start of the next syllable, taking pot luck with what's there. That gives two different syllable structures. As I supported the style found in a certain dictionary, it sometimes belongs with the syllable before, and sometimes with the syllable after it. I therefore ended up defined the sequences to be shaped as a sequence of one or more syllables joined together by U+1A58. Fortunately, normal cursor motion is controlled by a different definition. (I'm still using Emacs 24.4 with the restoration of interactive commands forward-char-intrusive and backward-char-intrusive and their interface within the C code.) > I understand that the number of combinations is theoretically > unbounded. I'm asking if it is also unbounded in practice. That is, > do font designers add ligatures for arbitrary combinations of > characters, regardless of some reasonable set of requirements? For > example, is the set of ligatures of Latin characters shown here: > > https://en.wikipedia.org/wiki/Orthographic_ligature#Latin_alphabet > > reasonably complete, or should I expect any number of other arbitrary > combinations of Latin characters popping up in fonts? And if the > latter, then what is the purpose of providing such arbitrary > ligatures? Doesn't the existence of ligatures for 'Eisenhower' and 'Chamberlain' provide enough of an answer? If you claim to support handwriting fonts, then you can expect others - 'sh', 'tt' and 'ing' are fairly obvious ones. You may also find ligatures being used to sort out kerning issues. One problem I've observed with computer fonts is that the spacing of glyphs in a string is not consistent. This appears to be due to the way the positioning of the glyphs is rounded. The problem can be bad enough that the designer ends up fixing the problem by combining them into a single glyph, which formally is a ligature. I've not noticed this in ASCII fonts, but then I haven't looked hard at them. The 'tt' ligature can arise because the two t's are crossed by a single stroke. Crossing the 't' in 'lt' might be handled by a special 't' glyph, or one might just form an 'lt' ligature. The ending 'ing' is common enough that I unconsciously developed an abbreviated way of writing it. > I'm not talking about Arabic. Emacs has a set of regular expressions > for sequences of Arabic characters that need shaping, misc-lang.el in > Emacs. If the set is incomplete, we can augment it. That regular expression treats every Arabic word as in need of shaping. > If a font requires special shaping for any sequence of any number of > 26 (or maybe 52) ASCII letters, then the Emacs display engine will > need to be redesigned. So this extreme possibility doesn't bother me. In general, they do require it. But how is this worse than handling Arabic? Is the problem that you want to keep the option of line wrapping splitting words for ASCII, but are not bothered for Arabic or other human languages? ASCII does not satisfyingly suffice for English. > > How would you handle the possibility that all three of <æ>, <a, e> > > and <a, ZWJ, e> might be rendered by the same glyph, althouɡh they > > are comprised of 1, 2 and 3 characters respectively? > > By using a composition rule that matches both <a, e> and <a, ZWJ, e>. > The rules are regexp-based, and expressing the above as a regexp is > simple. Once a sequence of characters matches the regexp, Emacs calls > the shaper (hb_shape etc.) to produce the font glyphs for the > sequence, and displays the glyphs that the shaper returns. I think you mean that Emacs would store the position of components by an index that was the sequence of characters, not the glyph ID. That would also deal with precomposed characters - it would be the character sequence that mattered, and for cursor movement and rendering, the canonically equivalent sequence(s) and the precomposed character would remain distinct. Richard. _______________________________________________ HarfBuzz mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/harfbuzz
