On Sun, Feb 04, 2024 at 12:17:16PM +0100, Patrice Dumas wrote: > On Thu, Feb 01, 2024 at 10:16:07PM +0000, Gavin Smith wrote: > > An alternative is not to have such a variable but just to have an option > > to collate according to the user's locale. Then the user would run e.g. > > "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the ll_LL.UTF-8 > > locale. They would have to have the locale installed that was appropriate > > for whichever manual they were processing (assuming the "variable weighting" > > option is appropriate.) > > I do not like that possibility, I think that we should avoid using the user > locales when it comes to document output in general. If we use the user > locale I think that it should be by using strxfrm in C and "use locale" in > Perl, not by checking a specific LC_COLLATE value in the environment.
(Note that "cmp" is documented not to work with "use locale" for UTF-8 strings: "lt", "le", "ge", "gt" and "cmp" use the collation (sort) order specified by the current "LC_COLLATE" locale if a "use locale" form that includes collation is in effect. See perllocale. Do not mix these with Unicode, only use them with legacy 8-bit locale encodings. The standard "Unicode::Collate" and "Unicode::Collate::Locale" modules offer much more powerful solutions to collation issues. - from "man perlop".) > Here is my updated thinking on the possibilities > > 1) lexicographic sorting on unicode strings (corresponds to > USE_UNICODE_COLLATION=0 currently) > 2) unicode default sorting obtained by Unicode::Collate in Perl and > strxfrm_l in C with "en_US.utf-8", the current default ("en_US.utf-8" > could be different on different platforms, a list instead of only one > possibility if "en_US.utf-8" is not always available...) > 3) sorting based on @documentlanguage using, in perl > Unicode::Collate::Locale with locale @documentlanguage and in C > strxfrm_l with "@documentlanguage.utf-8" (at least on GNU/Linux, > the locale name setup for strxfrm_l could be different on other platforms). > 4) sorting based on a customization variable, such as COLLATION_LANGUAGE. > it would be the same as the previous one, with @documentlanguage > replaced by COLLATION_LANGUAGE. > 5) sorting based on the user locale, using strxfrm in C and > "use locale" and regular sorting on unicode (internal perl encoded) strings > in Perl. > My concern here is that there are far too many options for the user to decide between. They also interact with whether XS or pure Perl modules are being used (which depends on environment variables such as TEXINFO_XS_STRUCTURE and other things). As far as possible the interface should not specify whether the sorting is done in C or Perl. What's of interest to the user are the following three things: speed, correctness, and language-specific tailoring. I think many possibilities can be covered with three customization variables, USE_UNICODE_COLLATION, COLLATION_LOCALE and COLLATION_LANGUAGE: 1) would be done with USE_UNICODE_COLLATION=0 as you say. This could also be implemented in C with strcmp (as Andreas pointed out). 2) is two different types of sorting; as you said earlier, the sorting in C may have a different treatment of "variable elements". The first would be accessed with USE_UNICODE_COLLATION=1 and the second with COLLATION_LOCALE=en_US.UTF-8 or possibly COLLATION_LOCALE=en_US. Using strxfrm with en_US would not be the default because of the handling of spaces and also because the interface isn't very portable. 3) and 4) are again potentially different between C and pure Perl. I propose that COLLATION_LOCALE would be used for accessing system locales (with strxfrm or strcoll in C, but in theory this is language-independent.) COLLATION_LANGUAGE would be an argument to use for Unicode::Collate::Locale to get language-specific tailoring, which in language-independent terms means to use the UCA with tailoring, with variable collation elements treated as "non-ignorable". If there is ever a separate implementation of the UCA in texi2any with access to tailoring, COLLATION_LANGUAGE would govern it as well. For 3), accessing @documentlanguage seems like an unnecessary extra at the moment. Again, there would be the problem of strxfrm_l and Unicode::Collate::Locale doing different things with variable collation elements. There is no guarantee that the user has the appropriate locale installed either (for use with strxfrm_l) or that the language is supported by Unicode::Collate::Locale. I'm ok with 5) not being implemented (using LC_COLLATE from the user's locale). Does this all look okay? Have I missed anything? > I forgot about one possibility, until there is a possibility to have > Non-ignorable Weighting in C it could make sense to have as another > possibility for C, the possibility to call perl code to obtain 2), which > would lead to > > 6) in C use Perl sorting corresponding to 2). > > Could be named 'perldefault'. I can't understand what you are proposing here. Is this not just the same as using Unicode::Collate? What difference does it make if the Unicode::Collate module is called from C or Perl code? > 1) and 2) are already implemented and currently customized with > USE_UNICODE_COLLATION. I do not think that we need 5), but we could > implement it if users ask for it. We do not need to implement the other > options right away, but we may want to think about the way to select > those options such as not to change the customization options when they > are implemented. I think that the options are > > * use only one variable with a textual value, for example with, for 1-5 > above > USE_COLLATION=basic/default/documentlanguage/custom/locale > * use different variables as switches between the different options, for > instance USE_UNICODE_COLLATION to switch to 1), and more or less > one variable for each of the other possibilities. I think it's tiresome for a user to read through a list of collation options in the documentation. Especially for the "custom" language, "-c COLLATION_LANGUAGE=se_SE" is much more concise than "-c USE_COLLATION=custom -c COLLATION_LANGUAGE=se_SE" (or, as I suggested "LC_COLLATE=se_SE texi2any -c USE_COLLATION=locale"). It's always better if the user can specify a single setting rather than having to use several in combination. "default" is not a good name for an option value, in my opinion, as all it says is that that option is the default, but not anything about what it means.