> From: Gavin Smith <gavinsmith0...@gmail.com> > Date: Wed, 31 Jan 2024 20:10:56 +0000 > > It seems like a pretty obscure interface. It is barely > documented - newlocale is in the Linux Man Pages but not the > glibc manual, and strxfrm_l was only in the Posix standard > (https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html). > I don't know of any other way of accessing the collation functionality. > > Do you know how portable it is?
AFAIK, this is glibc-specific. In general, the implementations of Unicode TR10 differ among platforms, with glibc offering the most complete and compatible implementation and the CLDR DB to support it (what you discovered in /usr/share/i18n/locales on your system). MS-Windows has a similar, but different in effect, functionality, see https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringw It supports various flags, described here: https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringex that affect the handling of collation weights. For example, the NORM_IGNORESYMBOLS flag will have an effect similar to what Patrice found: spaces (and other punctuation characters) are ignored when sorting. CompareStringW accepts "wide" strings, i.e. a string should be converted to UTF-16 encoding before calling it. There's a similar CompareStringA, which accepts 'char *' strings, but it can only compare strings whose characters are all representable in the current system locale's codeset; if we want to have all text represented internally in UTF-8, we should probably convert UTF-8 to UTF-16 and use CompareStringW. I don't know about *BSD and other platforms, but wouldn't be surprised if they offered something of their own, still different from glibc and/or strict TR10/CLDR compliance. > Moreover, en_US.utf-8 will use collation appropriate for (US) English. > There may be language-specific "tailoring" for other languages (e.g. > Swedish) that the user may wish to use instead. Hence, it may be > a good idea to allow use of a user-specified locale for collation through > the C code. Probably. Note that CompareStringW gives the caller a finer control: they can tailor the handling of different weight categories, beyond setting the locale for which the collation is needed. Also, the locale argument is defined differently for CompareStringW than via the Posix-style setlocale or similar APIs (but that's something for the implementation to figure out). > I found some locale definition files on my system under > /usr/share/i18n/locales (location mention in man page of the "locale" > command) and there is a file iso14651_t1_common which appears to be > based on the Unicode Collation tables. I have only skimmed this file > and don't understand the file format well (it's supposed to be documented > in the output of "man 5 locale"), but is really part of glibc internals. > > In that file, space has a line > > <U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE > > which appears to define space as a fourth-level collation element, > corresponding to the Shifted option at the link above: > > "Shifted: Variable collation elements are reset to zero at levels one > through three. In addition, a new fourth-level weight is appended..." > > In the Default Unicode Collation Element Table (DUCET), space has the line > > 0020 ; [*0209.0020.0002] # SPACE > > with the "*" character denoting it as a "variable" collation element. > > I expect it would require creating a glibc locale to change the collation > order, which is not something we can do. I think if we want to ponder these aspects we should talk to the glibc developers about the available options.