Re: Reason for breaking display of لا

Marc Lehmann Mon, 02 Jan 2023 13:43:17 -0800

On Wed, Dec 28, 2022 at 01:27:15PM -0500, Thomas Guyot-Sionnest 
<[email protected]> wrote:
> On 2022-12-28 04:49, Marc Lehmann wrote:
> > On Wed, Dec 28, 2022 at 10:46:58AM +0330, Avesta Sabayemoghadam 
> > <[email protected]> wrote:
> > > character takes 2 bytes so normally لا is an array of two chars with the
> > > size of 4 bytes. But لا has it's special Unicode value "U+FEFB" which 
> > > takes
> > urxvt does not store characters in bytes, so this does not apply. urxvt
> > has no trouble storing that character, and the links you provided explain
> > that.
> > 
> > investigating the real issue is on our todo, but this is a complicated
> > problem, and at this point, urxvt does not support arabic
> > rendering/combining.
> 
> Isn't that like ligatures?


IT's not excactly the same, but yes, it's pretty much like ligatures.

The issue is that ligatures are optional, while in arabic (at leats that is
my understanding) tet becomes pretty much unreadable without shape combining.

The issue for terminals is that they are clel based, that is, every separate
character has its own grid position. Arabic script breaks that, and
therefore, the characters will not be combined, at leats not in general (they
would be combined if they would exactly fit).

The other issue is that xft/freetype have removed rendering support for
any kind of ligatures, and since that is the tehcnology being used, it's
unlikely to work even if everything else were right.

> If we take the ligature "ﬁ" for instance, it can be de-normalized into its
> individual components "f" and "i", but cannot be normalized back.

Actually, it has to. At least that's my understanding. And non-cell-based
input generally does that (i.e. ﻝ + ﺍ looks very different without the
intervening " + ", or at least, should). If you mean that you can't just
blidnly do it, that is true, as to whether which glyph it conbines into
depends on context, i.e. the same two characters have multiple different
combined glyphs, something that is beyond what urxvt can do.

> It appears the difference here is that these two characters are always shown
> in their combined form as they're specific to Arabic script. I'm suspecting
> this is done by the font's ligatures as they still shows as two characters,
> you can always get the cursor in between and press space, then you'll get
> the individual characters...

Urxvt does combining, but for these two characters, there are multiple
possibilities that cnanot be chosen without context analysis.

The solution here would be to utilize sth. like harfbuzz, but it is
not clear how to apply harfbuzz zo a cell based terminal - eiuther you
break the terminal, or the script, and urxvt choses to break the script,
essentially declaring arabic script as fundamentally incompatible to
terminal output.

I am sure a better compromise exists, but it's not clear what it is, and
we wouldn't know to find it.

> These decimal values are the same as the hex values above for the single
> Unicode char (1x 24bit char, so 3x 8bit) and composing characters (2x 16-bit
> chars, so 4x 8bit total). Note that you cannot go back to the 3-bytes
> version after doing the NKFC normalization... You can find more info about
> the normalization forms at https://unicode.org/reports/tr15/.

With unicode, it is generlaly much better to talk about unicode code points
and characters, and not about their specific encoding (there are no 3 octet
characters in unicode, that would be specific to the encoding used).

It also doesn't help with urxvt, as urxvt doesn't encode characters
internally but uses unicode for everything.

> If you're working with de-normalized text it should be fairly simple to
> write a filter that combines these two but I presume there's a lot more
> ligatures in Arabic that would have to be handled.

urxvt already does that (you cna see the table it uses in
src/table/compose.h), but that isn't enough for many scripts.

> So, I'm not sure if there's an easy fix for that, maybe allowing font
> ligatures would suffice...

urxvt already allows font ligatures, but for multiple reasons, they are
not being rendered by freetype (some being due ti freetype having removed
the code, and some being xft not getting the chance because letters are
rendered in separate draw calls).

> FWIW I use FiraCode in urxvt and ligatures aren't shown - everywhere else I
> use that font where ligatures works I get the combined form. As a last test
> I tried disabling ligatures in VS Code and it reverted to the individual
> form, even slightly overlapped so that was even worse, so I'm even more
> convinced it's done by ligatures now...

Yes, the problems affect evereything using ligatures. And the solution is not
known either - for example, if firacode has a ligature, but it's not wide
enough to fill the space, you get gaps. There is no way to tell the font
renderer to render those gaps, so urxvt has to split rendering into multiple
calls, which in turn breaks ligaturs, even *if* the font renderer would
render them in the first place.

The strict grid of a terminal requires characters to have exact matching
width. Most modern fonts do not support anything like that (monoespace
does not mean the characters have a fixed width, something that the old
x core font system could handle, but modern systems such as harfbuzz and
fontconfig do not), and rather than simply forbidding use of those fonts,
urxvt tries to make the best of it, and something has to give.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      [email protected]
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
rxvt-unicode mailing list
[email protected]
http://lists.schmorp.de/mailman/listinfo/rxvt-unicode

Re: Reason for breaking display of لا

Reply via email to