Richard, I am working on a new update of InSC for Unicode 8.0, which is available at https://github.com/roozbehp/unicode-data.
After that, we'll push that into HarfBuzz. It would be best if you suggest updates to the Unicode property instead, including potentially subdividing a property value. In this way, users of all implementations (including Microsoft's Universal Shaping Engine) would benefit. Please take a look and send me or UTC your suggestions (or file bugs at https://github.com/roozbehp/unicode-data/issues). If there was still a need to change something in HarfBuzz, we can do that too. On Mon, Feb 23, 2015 at 4:31 PM, Richard Wordingham < [email protected]> wrote: > On Sun, 1 Feb 2015 01:39:42 +0000 > Richard Wordingham <[email protected]> wrote: > > > I've been having some problems with spurious dotted circles in various > > versions of HarfBuzz, and I thought I would share before proposing a > > complete solution to Behdad. > > Well, no-one has shown any interest, so I will go ahead with my > proposals/requests. For ease of reference, I have deleted little from > my original post. > > > I've been looking at 3 versions of HarfBuzz: > > > 'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz > > is in that version of LibreOffice. > > When checking the version later, I saw 'LibreOffice 4.3.3.2', so it's > possible LbreOffice 4.3.4 is different. > > > 'HarfBuzz 0.9.38+', i.e. the latest sources at some time today. > > Some time on Saturday 31 January 2015 might be more precise. > > > 'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable > > Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post > > 'Indic Syllable Categories' > > http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html). > > > These categories are defined in HarfBuzz by file > > hb-ot-shape-complex-indic-table.cc. I was about to formally submit my > > suggestions to the Unicode Technical Committee, but then I discovered > > that the changes would adversely affect HarfBuzz. > > > The first problem arose with U+1A7B MAI SAM. While there > > is no problem with its uses to indicate word (or phrase) repetition by > > marking the last akshara and to indicate the merger of two 1-consonant > > vowelless consonant stacks, a dotted circle occurs in the example > > example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI > > SAM, U+1A6B SIGN O, U+1A41 RA>. The problem is that MAI SAM has an > > ISC of 'other', so U+25CC in inserted before SIGN O. Making MAI SAM a > > 'dependent vowel' as I had suggested fixed this problem. > > > The second problem arose with U+1A7A RA HAAM, and could also arise > > with U+1A7C KARAN. The problem is that with the influx of foreign > > loans into Thai, in Thailand there are now clusters of two consonants > > in which the *first* consonant cluster is silent. In most cases, > > there is no way for Tai Tham to show which is silent, but when the > > tail of the second consonant rises to the hanging baseline, the > > placement of the cancellation marks tends to show which consonant is > > cancelled. A (hpyothetical) example is the English surname 'Dawes', > > which is represented with three consonants in Thai. The > > transliteration of 'w' is marked as silent. Conversely, 'Howes' > > would be written with the transliteration of the 's' as silent. This > > prevents the font deciding the placement of the cancellation mark on > > a cluster by cluster basis. Following the lead of Thai, this would be > > written <U+1A2F DA, U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM, > > U+1A60 SAKOT, U+1A48 HIGH SA>. > > > LibreOffice 4.3.4 splits the cluster into three syllables, <WA, > > SAKOT>, <RA HAAM> and <HIGH SA>, and the problem is simply that the > > SAKOT>subscript > > form cannot be generated until after the syllable boundaries are > > dropped. This is simply a variant of the font-soluble but for the > > future eliminated tone and SAKOT problem. > > > > HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>, > > <RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of > > 'other'. New ISC marks RA HAAM as a 'pure killer'. Unfortunately, > > this does not change the misdeduced syllable structure. I think the > > analysis needs to treat the sequence 'pure killer', 'invisible > > stacker' as being within a single syllable. Is this too much to ask > > for? > > > > The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC > > DOT, and possibly is not a real problem. I have too few examples of > > the character's use. CRYPTOGRAMMIC DOT currently has an ISC of > > 'other', so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence > > <U+1A49 HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three > > syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>. It > > is possible that the input sequence will not occur in the wild. In > > 'New ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the > > sequence is treated as a single syllable, as desired. > > The first (MAI SAM) and third (CRYPTOGRAMMIC DOT) problems are solved > by recategorising the characters as interior members of the syllable, > with Indic Syllabic categories 'matra' and 'nukta' respectively. I > recommend that HarfBuzz make these changes, in the file > hb-ot-shape-complex-indic-table.cc. > > MAI SAM is an odd matra, as its placement is determined by its phonetic > role, not its visual position. It is worth noting that this mark is a > superscript version of U+1A91 TAI THAM THAM DIGIT TWO, and that its > core meaning is that there are two of something, not just one of > something. A classification as 'Consonant_medial' would work just as > well. > > For the second problem, an analogue can be found in the Khmer > sequences <U+179C KHMER LETTER VO, U+17CD KHMER SIGN TOANDAKHIAT, > U+17D2 KHMER SING COENG, U+179F KHMER LETTER SA> and its anagram > <U+179C, U+17D2, U+179F, U+17CD>, which render without complaint and > slightly differently in both HarfBuzz and Windows 7 (other versions not > tested). > > Now, at present, U+17CD is classified as 'Vowel_Dependent' by an > explicit override in gen-indic-table.py, the generator of > hb-ot-shape-complex-indic-table.cc. The same treatment would suffice > for U+1A7A RA HAAM and U+1A7C KARAN. > > > The next problem was with the admittedly unusual writing <U+1A93 THAM > > DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'. None of the > > three versions allowed the digit to be treated as a consonant base, > > and so U+25CC was introduced before SAKOT. Does the SEA engine need > > to be specifically instructed to treat Tai Tham decimal numbers as > > potential character bases? > > The answer, I see, is that it does need to be so instructed. > > > Some of my changes for 'New ISC' had bad consequences. Changing > > U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel > > resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into > > two syllables, <LOW CA, SAKOT> and <LAE>. While the font can work > > round this, this is not good. > > Occasional subscripting of ancient independent vowels has been > reported, and I think HarfBuzz should support this behaviour. > > > Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to > > 'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75 > > TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC, > > TONE-1>. This seems odd; U+0ECD LAO NIGGAHITA is classified by > > Unicode as 'bindu', yet regularly has tone marks mounted on it. Is > > the syllable splitting here a HarfBuzz error? > > The problem with anusvara (Indic syllabic category 'bindu') is that > there are two types - those that terminate the syllable (a subgroup of > Indic syllabic category type OT_SM), and those that are more matra-like > (the rare category type OT_A). The file > hb-ot-shape-complex-indic-table.cc maps them to the category type > OT_SM, but the SEA syllable analyser is set up for category OT_A. > At present, assignments to Indic category OT_A are done by > executable code checking the character codes, and many of the > characters in this group are in fact Vedic tone marks! > > I think this is an area where HarfBuzz will just have to override the > Unicode settings - the general categorisations don't help with layout > constraints. > > Richard. > _______________________________________________ > HarfBuzz mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/harfbuzz >
_______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
