On Wed, Dec 28, 2022 at 01:27:15PM -0500, Thomas Guyot-Sionnest <[email protected]> wrote: > On 2022-12-28 04:49, Marc Lehmann wrote: > > On Wed, Dec 28, 2022 at 10:46:58AM +0330, Avesta Sabayemoghadam > > <[email protected]> wrote: > > > character takes 2 bytes so normally لا is an array of two chars with the > > > size of 4 bytes. But لا has it's special Unicode value "U+FEFB" which > > > takes > > urxvt does not store characters in bytes, so this does not apply. urxvt > > has no trouble storing that character, and the links you provided explain > > that. > > > > investigating the real issue is on our todo, but this is a complicated > > problem, and at this point, urxvt does not support arabic > > rendering/combining. > > Isn't that like ligatures?
IT's not excactly the same, but yes, it's pretty much like ligatures. The issue is that ligatures are optional, while in arabic (at leats that is my understanding) tet becomes pretty much unreadable without shape combining. The issue for terminals is that they are clel based, that is, every separate character has its own grid position. Arabic script breaks that, and therefore, the characters will not be combined, at leats not in general (they would be combined if they would exactly fit). The other issue is that xft/freetype have removed rendering support for any kind of ligatures, and since that is the tehcnology being used, it's unlikely to work even if everything else were right. > If we take the ligature "fi" for instance, it can be de-normalized into its > individual components "f" and "i", but cannot be normalized back. Actually, it has to. At least that's my understanding. And non-cell-based input generally does that (i.e. ﻝ + ﺍ looks very different without the intervening " + ", or at least, should). If you mean that you can't just blidnly do it, that is true, as to whether which glyph it conbines into depends on context, i.e. the same two characters have multiple different combined glyphs, something that is beyond what urxvt can do. > It appears the difference here is that these two characters are always shown > in their combined form as they're specific to Arabic script. I'm suspecting > this is done by the font's ligatures as they still shows as two characters, > you can always get the cursor in between and press space, then you'll get > the individual characters... Urxvt does combining, but for these two characters, there are multiple possibilities that cnanot be chosen without context analysis. The solution here would be to utilize sth. like harfbuzz, but it is not clear how to apply harfbuzz zo a cell based terminal - eiuther you break the terminal, or the script, and urxvt choses to break the script, essentially declaring arabic script as fundamentally incompatible to terminal output. I am sure a better compromise exists, but it's not clear what it is, and we wouldn't know to find it. > These decimal values are the same as the hex values above for the single > Unicode char (1x 24bit char, so 3x 8bit) and composing characters (2x 16-bit > chars, so 4x 8bit total). Note that you cannot go back to the 3-bytes > version after doing the NKFC normalization... You can find more info about > the normalization forms at https://unicode.org/reports/tr15/. With unicode, it is generlaly much better to talk about unicode code points and characters, and not about their specific encoding (there are no 3 octet characters in unicode, that would be specific to the encoding used). It also doesn't help with urxvt, as urxvt doesn't encode characters internally but uses unicode for everything. > If you're working with de-normalized text it should be fairly simple to > write a filter that combines these two but I presume there's a lot more > ligatures in Arabic that would have to be handled. urxvt already does that (you cna see the table it uses in src/table/compose.h), but that isn't enough for many scripts. > So, I'm not sure if there's an easy fix for that, maybe allowing font > ligatures would suffice... urxvt already allows font ligatures, but for multiple reasons, they are not being rendered by freetype (some being due ti freetype having removed the code, and some being xft not getting the chance because letters are rendered in separate draw calls). > FWIW I use FiraCode in urxvt and ligatures aren't shown - everywhere else I > use that font where ligatures works I get the combined form. As a last test > I tried disabling ligatures in VS Code and it reverted to the individual > form, even slightly overlapped so that was even worse, so I'm even more > convinced it's done by ligatures now... Yes, the problems affect evereything using ligatures. And the solution is not known either - for example, if firacode has a ligature, but it's not wide enough to fill the space, you get gaps. There is no way to tell the font renderer to render those gaps, so urxvt has to split rendering into multiple calls, which in turn breaks ligaturs, even *if* the font renderer would render them in the first place. The strict grid of a terminal requires characters to have exact matching width. Most modern fonts do not support anything like that (monoespace does not mean the characters have a fixed width, something that the old x core font system could handle, but modern systems such as harfbuzz and fontconfig do not), and rather than simply forbidding use of those fonts, urxvt tries to make the best of it, and something has to give. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / [email protected] -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ rxvt-unicode mailing list [email protected] http://lists.schmorp.de/mailman/listinfo/rxvt-unicode
