On Wed, Jan 31, 2024 at 10:15:08AM +0100, Patrice Dumas wrote: > Hello, > > I implemented index sorting in C with XS interface in texi2any. > When unicode collation is wanted, based on my understanding of > Eli suggestions, a collation locale is set to "en_US.utf-8", by > newlocale (LC_COLLATE_MASK, "en_US.utf-8", 0) > and then strxfrm_l is used (which should be the same as using > strcoll_l). With conversion in C/with XS set with environment variable > TEXINFO_XS_CONVERT=1 and for now only for HTML, if TEST customization > variable is not set.
It seems like a pretty obscure interface. It is barely documented - newlocale is in the Linux Man Pages but not the glibc manual, and strxfrm_l was only in the Posix standard (https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html). I don't know of any other way of accessing the collation functionality. Do you know how portable it is? The documentation for the corresponding Gnulib module says the following: Portability problems not fixed by Gnulib: This function is missing on many platforms: FreeBSD 6.0, NetBSD 5.0, OpenBSD 6.0, Minix 3.1.8, AIX 5.1, HP-UX 11, IRIX 6.5, Solaris 11.3, Cygwin 1.7.x, mingw, MSVC 14, Android 4.4. <https://www.gnu.org/software/gnulib/manual/html_node/strxfrm_005fl.html> Could it be possible to have an option of "current locale" collation which could use more standard interfaces? Moreover, en_US.utf-8 will use collation appropriate for (US) English. There may be language-specific "tailoring" for other languages (e.g. Swedish) that the user may wish to use instead. Hence, it may be a good idea to allow use of a user-specified locale for collation through the C code. > On my debian GNU/Linux, the result is good except for the treatment of > spaces. Indeed, spaces (and non alphanumeric characters, but it is > not really an issue) are ignored when sorting, which sticks to the Unicode > collation standard, but leads to an awkward sorting for indices, for > example 'H r' is sorted after 'Ha'. In perl, it is possible to > customize the Unicode::Collate collation, we use 'variable' => > 'Non-Ignorable'. I think either way is in accordance with the collation standard. The standard gives four options and "Non-ignorable" is one of them: http://www.unicode.org/reports/tr10/#Variable_Weighting I doubt it is possible to customize the collation of a locale with a function such as newlocale. I expect the collation order is fixed when the locale is defined. I found some locale definition files on my system under /usr/share/i18n/locales (location mention in man page of the "locale" command) and there is a file iso14651_t1_common which appears to be based on the Unicode Collation tables. I have only skimmed this file and don't understand the file format well (it's supposed to be documented in the output of "man 5 locale"), but is really part of glibc internals. In that file, space has a line <U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE which appears to define space as a fourth-level collation element, corresponding to the Shifted option at the link above: "Shifted: Variable collation elements are reset to zero at levels one through three. In addition, a new fourth-level weight is appended..." In the Default Unicode Collation Element Table (DUCET), space has the line 0020 ; [*0209.0020.0002] # SPACE with the "*" character denoting it as a "variable" collation element. I expect it would require creating a glibc locale to change the collation order, which is not something we can do.