On Sun, Feb 04, 2024 at 03:42:22PM +0000, Gavin Smith wrote: > On Sun, Feb 04, 2024 at 12:17:16PM +0100, Patrice Dumas wrote: > > On Thu, Feb 01, 2024 at 10:16:07PM +0000, Gavin Smith wrote: > > > An alternative is not to have such a variable but just to have an option > > > to collate according to the user's locale. Then the user would run e.g. > > > "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the > > > ll_LL.UTF-8 > > > locale. They would have to have the locale installed that was appropriate > > > for whichever manual they were processing (assuming the "variable > > > weighting" > > > option is appropriate.) > > > > I do not like that possibility, I think that we should avoid using the user > > locales when it comes to document output in general. If we use the user > > locale I think that it should be by using strxfrm in C and "use locale" in > > Perl, not by checking a specific LC_COLLATE value in the environment. > > (Note that "cmp" is documented not to work with "use locale" for UTF-8 > strings: > > "lt", "le", "ge", "gt" and "cmp" use the collation (sort) order > specified by the current "LC_COLLATE" locale if a "use locale" form > that includes collation is in effect. See perllocale. Do not mix > these with Unicode, only use them with legacy 8-bit locale encodings. > The standard "Unicode::Collate" and "Unicode::Collate::Locale" modules > offer much more powerful solutions to collation issues. > > - from "man perlop".)
Thanks. This is very confusing to me, then, as it is not told that way in perllocale, especially the section: https://perldoc.perl.org/perllocale#Category-LC_COLLATE%3A-Collation%3A-Text-Comparisons-and-Sorting There is more information in the end of the page that may correspond better to the perlop information. Not important at all anyway since we agree that using the user locale is not a good idea in any case. > > Here is my updated thinking on the possibilities > > > > 1) lexicographic sorting on unicode strings (corresponds to > > USE_UNICODE_COLLATION=0 currently) > > 2) unicode default sorting obtained by Unicode::Collate in Perl and > > strxfrm_l in C with "en_US.utf-8", the current default ("en_US.utf-8" > > could be different on different platforms, a list instead of only one > > possibility if "en_US.utf-8" is not always available...) > > 3) sorting based on @documentlanguage using, in perl > > Unicode::Collate::Locale with locale @documentlanguage and in C > > strxfrm_l with "@documentlanguage.utf-8" (at least on GNU/Linux, > > the locale name setup for strxfrm_l could be different on other > > platforms). > > 4) sorting based on a customization variable, such as COLLATION_LANGUAGE. > > it would be the same as the previous one, with @documentlanguage > > replaced by COLLATION_LANGUAGE. > > 5) sorting based on the user locale, using strxfrm in C and > > "use locale" and regular sorting on unicode (internal perl encoded) > > strings > > in Perl. > > > > My concern here is that there are far too many options for the user to > decide between. They also interact with whether XS or pure Perl modules > are being used (which depends on environment variables such as > TEXINFO_XS_STRUCTURE and other things). As far as possible the > interface should not specify whether the sorting is done in C or Perl. Ok, in principle, but I am not sure that it is really possible given the differences. > What's of interest to the user are the following three things: speed, > correctness, and language-specific tailoring. Point is that C is better for speed, but Perl is better for correctness. > I think many possibilities can be covered with three customization > variables, USE_UNICODE_COLLATION, COLLATION_LOCALE and COLLATION_LANGUAGE: > > 1) would be done with USE_UNICODE_COLLATION=0 as you say. This could > also be implemented in C with strcmp (as Andreas pointed out). strcmp is always used as a transformation on the string is done with strxfrm_l for the collation in C. If USE_UNICODE_COLLATION=0 the string is not transformed, which amounts to using strcmp on the original string. Therefore it is already implemented that way in C, as can be seen in tp/Texinfo/XS/main/manipulate_indices.c. > 2) is two different types of sorting; as you said earlier, the > sorting in C may have a different treatment of "variable elements". > The first would be accessed with USE_UNICODE_COLLATION=1 and the second > with COLLATION_LOCALE=en_US.UTF-8 or possibly COLLATION_LOCALE=en_US. > Using strxfrm with en_US would not be the default because of the handling > of spaces and also because the interface isn't very portable. So, what you mean here is is that with USE_UNICODE_COLLATION=1 and no COLLATION_LOCALE, C code should call the perl code that sort indices using Unicode::Locale instead of doing the sorting in C. Did I get it right? If COLLATION_LOCALE is set, in C strxfrm_l would be used to do the string transformation and sorting. If COLLATION_LOCALE is set in Perl, it is not clear to me what would be the output. Would it be ignored? The advantage I see of your proposal is that we would never need to select a specific locale, as is done currently with en_US.UTF-8. The downside is that in most cases, the users will not get the speed increase of using C, as it requires knowing about COLLATION_LOCALE which is likely to remain relatively obscure. This downside is not problematic right now, as Perl is more correct. However, if there is a possibility to get variable elements set to "non-ignorable" in C, possibly by using an hardcoded locale of en_US, it will not possible to get automatically both the correct and more rapid option. The user would still have to set COLLATION_LOCALE to get it. So, even if it is practical for the short time, I wonder if we should not already plan for a future in which C would be both correct and more rapid, but still with a cumbersome interface that requires setting a specific locale. > 3) and 4) are again potentially different between C and pure > Perl. I propose that COLLATION_LOCALE would be used for accessing > system locales (with strxfrm or strcoll in C, but in theory this is > language-independent.) COLLATION_LANGUAGE would be an argument to use > for Unicode::Collate::Locale to get language-specific tailoring, which > in language-independent terms means to use the UCA with tailoring, with > variable collation elements treated as "non-ignorable". If there is > ever a separate implementation of the UCA in texi2any with access to > tailoring, COLLATION_LANGUAGE would govern it as well. If I understand well, COLLATION_LANGUAGE would only change what is done in perl, with Unicode::Collate::Locale used if COLLATION_LANGUAGE is set. In that case, since perl is called from C if COLLATION_LOCALE is not set, COLLATION_LANGUAGE would apply to C because it calls Perl unless COLLATION_LOCALE is set. Note that it is not more clear to me what would happen with COLLATION_LOCALE in the Perl case. > For 3), accessing @documentlanguage seems like an unnecessary extra > at the moment. Again, there would be the problem of strxfrm_l and > Unicode::Collate::Locale doing different things with variable collation > elements. There is no guarantee that the user has the appropriate > locale installed either (for use with strxfrm_l) It seems to me that following @documentlanguage would be more desirable than being able to have the use specify a specific COLLATION_LANGUAGE (or COLLATION_LOCALE). Indeed, it seems to me to be more aligned with Texinfo, in which information is supposed to come primarily from the Texinfo manual. Also COLLATION_LANGUAGE and COLLATION_LOCALE suffer from the same problems that you describe for @documentlanguage based customization. Also, if COLLATION_LANGUAGE and/or COLLATION_LOCALE is implemented, it would be very easy to use what comes from @documentlanguage instead for any of these user-supplied values, so it is a bit strange not to do it. Lastly, and more importantly, even if it is implemented later, I think that the 'interface' with customization variables should be designed now. > or that the language > is supported by Unicode::Collate::Locale. This is not an issue, if not supported, there is a fallback to the default behaviour of Unicode::Collate. > > 6) in C use Perl sorting corresponding to 2). > > > > Could be named 'perldefault'. > > I can't understand what you are proposing here. Is this not just the > same as using Unicode::Collate? What difference does it make if > the Unicode::Collate module is called from C or Perl code? Speed and consistency. My idea was that if C is used, it is supposed to be used for everything in the default case, with exceptions only when needed and mostly for tests (when TEST=1). But I have no problem if things are done differently. As a side note, transliteration of file names is also different from C and from Perl, the Perl function is used if TEST=1, but otherwise the result are different if TEXINFO_XS_CONVERT=1. -- Pat