Re: index sorting in texi2any in C issue with spaces

Eli Zaretskii Wed, 31 Jan 2024 22:52:13 -0800

> From: Gavin Smith <gavinsmith0...@gmail.com>
> Date: Wed, 31 Jan 2024 20:10:56 +0000
> 
> It seems like a pretty obscure interface.  It is barely
> documented - newlocale is in the Linux Man Pages but not the
> glibc manual, and strxfrm_l was only in the Posix standard
> (https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html).
> I don't know of any other way of accessing the collation functionality.
> 
> Do you know how portable it is?


AFAIK, this is glibc-specific.

In general, the implementations of Unicode TR10 differ among
platforms, with glibc offering the most complete and compatible
implementation and the CLDR DB to support it (what you discovered in
/usr/share/i18n/locales on your system).  MS-Windows has a similar,
but different in effect, functionality, see

  
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringw

It supports various flags, described here:

  
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringex

that affect the handling of collation weights.  For example, the
NORM_IGNORESYMBOLS flag will have an effect similar to what Patrice
found: spaces (and other punctuation characters) are ignored when
sorting.

CompareStringW accepts "wide" strings, i.e. a string should be
converted to UTF-16 encoding before calling it.  There's a similar
CompareStringA, which accepts 'char *' strings, but it can only
compare strings whose characters are all representable in the current
system locale's codeset; if we want to have all text represented
internally in UTF-8, we should probably convert UTF-8 to UTF-16 and
use CompareStringW.

I don't know about *BSD and other platforms, but wouldn't be surprised
if they offered something of their own, still different from glibc
and/or strict TR10/CLDR compliance.

> Moreover, en_US.utf-8 will use collation appropriate for (US) English.
> There may be language-specific "tailoring" for other languages (e.g.
> Swedish) that the user may wish to use instead.  Hence, it may be
> a good idea to allow use of a user-specified locale for collation through
> the C code.

Probably.  Note that CompareStringW gives the caller a finer control:
they can tailor the handling of different weight categories, beyond
setting the locale for which the collation is needed.  Also, the
locale argument is defined differently for CompareStringW than via the
Posix-style setlocale or similar APIs (but that's something for the
implementation to figure out).

> I found some locale definition files on my system under
> /usr/share/i18n/locales (location mention in man page of the "locale"
> command) and there is a file iso14651_t1_common which appears to be
> based on the Unicode Collation tables.  I have only skimmed this file
> and don't understand the file format well (it's supposed to be documented
> in the output of "man 5 locale"), but is really part of glibc internals.
> 
> In that file, space has a line
> 
> <U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
> 
> which appears to define space as a fourth-level collation element,
> corresponding to the Shifted option at the link above:
> 
>   "Shifted: Variable collation elements are reset to zero at levels one
>   through three. In addition, a new fourth-level weight is appended..."
> 
> In the Default Unicode Collation Element Table (DUCET), space has the line
> 
> 0020  ; [*0209.0020.0002] # SPACE
> 
> with the "*" character denoting it as a "variable" collation element.
> 
> I expect it would require creating a glibc locale to change the collation
> order, which is not something we can do.

I think if we want to ponder these aspects we should talk to the glibc
developers about the available options.

Re: index sorting in texi2any in C issue with spaces

Reply via email to