index sorting in texi2any in C issue with spaces

Patrice Dumas Wed, 31 Jan 2024 01:15:45 -0800

Hello,

I implemented index sorting in C with XS interface in texi2any.
When unicode collation is wanted, based on my understanding of
Eli suggestions, a collation locale is set to "en_US.utf-8", by
  newlocale (LC_COLLATE_MASK, "en_US.utf-8", 0)
and then strxfrm_l is used (which should be the same as using
strcoll_l).  With conversion in C/with XS set with environment variable
TEXINFO_XS_CONVERT=1 and for now only for HTML, if TEST customization
variable is not set.


On my debian GNU/Linux, the result is good except for the treatment of
spaces.  Indeed, spaces (and non alphanumeric characters, but it is
not really an issue) are ignored when sorting, which sticks to the Unicode
collation standard, but leads to an awkward sorting for indices, for
example 'H r' is sorted after 'Ha'.  In perl, it is possible to
customize the Unicode::Collate collation, we use 'variable' => 'Non-Ignorable'.
Here is the corresponding comment in the code:

  # The 'Non-Ignorable' for variable collation elements means that they are
  # treated as normal characters.   This allows to have spaces and punctuation
  # marks sort before letters.
  # http://www.unicode.org/reports/tr10/#Variable_Weighting

If somebody knows how to get the same result in C, please tell.

Also I have no idea how portable this setup is, but I guess testers and
time will tell.

-- 
Pat

index sorting in texi2any in C issue with spaces

Reply via email to