Re: index sorting in texi2any in C issue with spaces

Gavin Smith Wed, 31 Jan 2024 12:11:25 -0800

On Wed, Jan 31, 2024 at 10:15:08AM +0100, Patrice Dumas wrote:
> Hello,
> 
> I implemented index sorting in C with XS interface in texi2any.
> When unicode collation is wanted, based on my understanding of
> Eli suggestions, a collation locale is set to "en_US.utf-8", by
>   newlocale (LC_COLLATE_MASK, "en_US.utf-8", 0)
> and then strxfrm_l is used (which should be the same as using
> strcoll_l).  With conversion in C/with XS set with environment variable
> TEXINFO_XS_CONVERT=1 and for now only for HTML, if TEST customization
> variable is not set.


It seems like a pretty obscure interface.  It is barely
documented - newlocale is in the Linux Man Pages but not the
glibc manual, and strxfrm_l was only in the Posix standard
(https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html).
I don't know of any other way of accessing the collation functionality.

Do you know how portable it is?  The documentation for the corresponding
Gnulib module says the following:

  Portability problems not fixed by Gnulib:
  
  This function is missing on many platforms: FreeBSD 6.0, NetBSD 5.0,
  OpenBSD 6.0, Minix 3.1.8, AIX 5.1, HP-UX 11, IRIX 6.5, Solaris 11.3,
  Cygwin 1.7.x, mingw, MSVC 14, Android 4.4.

<https://www.gnu.org/software/gnulib/manual/html_node/strxfrm_005fl.html>

Could it be possible to have an option of "current locale" collation
which could use more standard interfaces?

Moreover, en_US.utf-8 will use collation appropriate for (US) English.
There may be language-specific "tailoring" for other languages (e.g.
Swedish) that the user may wish to use instead.  Hence, it may be
a good idea to allow use of a user-specified locale for collation through
the C code.

> On my debian GNU/Linux, the result is good except for the treatment of
> spaces.  Indeed, spaces (and non alphanumeric characters, but it is
> not really an issue) are ignored when sorting, which sticks to the Unicode
> collation standard, but leads to an awkward sorting for indices, for
> example 'H r' is sorted after 'Ha'.  In perl, it is possible to
> customize the Unicode::Collate collation, we use 'variable' => 
> 'Non-Ignorable'.

I think either way is in accordance with the collation standard.  The
standard gives four options and "Non-ignorable" is one of them:

http://www.unicode.org/reports/tr10/#Variable_Weighting

I doubt it is possible to customize the collation of a locale with
a function such as newlocale.  I expect the collation order is fixed
when the locale is defined.

I found some locale definition files on my system under
/usr/share/i18n/locales (location mention in man page of the "locale"
command) and there is a file iso14651_t1_common which appears to be
based on the Unicode Collation tables.  I have only skimmed this file
and don't understand the file format well (it's supposed to be documented
in the output of "man 5 locale"), but is really part of glibc internals.

In that file, space has a line

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE

which appears to define space as a fourth-level collation element,
corresponding to the Shifted option at the link above:

  "Shifted: Variable collation elements are reset to zero at levels one
  through three. In addition, a new fourth-level weight is appended..."

In the Default Unicode Collation Element Table (DUCET), space has the line

0020  ; [*0209.0020.0002] # SPACE

with the "*" character denoting it as a "variable" collation element.

I expect it would require creating a glibc locale to change the collation
order, which is not something we can do.

Re: index sorting in texi2any in C issue with spaces

Reply via email to