Re: index sorting in texi2any in C issue with spaces

Gavin Smith Sun, 04 Feb 2024 07:42:52 -0800

On Sun, Feb 04, 2024 at 12:17:16PM +0100, Patrice Dumas wrote:
> On Thu, Feb 01, 2024 at 10:16:07PM +0000, Gavin Smith wrote:
> > An alternative is not to have such a variable but just to have an option
> > to collate according to the user's locale.  Then the user would run e.g.
> > "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the ll_LL.UTF-8
> > locale.  They would have to have the locale installed that was appropriate
> > for whichever manual they were processing (assuming the "variable weighting"
> > option is appropriate.)
> 
> I do not like that possibility, I think that we should avoid using the user
> locales when it comes to document output in general.  If we use the user
> locale I think that it should be by using strxfrm in C and "use locale" in
> Perl, not by checking a specific LC_COLLATE value in the environment.


(Note that "cmp" is documented not to work with "use locale" for UTF-8
strings:

       "lt", "le", "ge", "gt" and "cmp" use the collation (sort) order
       specified by the current "LC_COLLATE" locale if a "use locale" form
       that includes collation is in effect.  See perllocale.  Do not mix
       these with Unicode, only use them with legacy 8-bit locale encodings.
       The standard "Unicode::Collate" and "Unicode::Collate::Locale" modules
       offer much more powerful solutions to collation issues.

- from "man perlop".)

> Here is my updated thinking on the possibilities
> 
> 1) lexicographic sorting on unicode strings (corresponds to
>                                  USE_UNICODE_COLLATION=0 currently)
> 2) unicode default sorting obtained by Unicode::Collate in Perl and
>    strxfrm_l in C with "en_US.utf-8", the current default ("en_US.utf-8"
>    could be different on different platforms, a list instead of only one
>    possibility if "en_US.utf-8" is not always available...)
> 3) sorting based on @documentlanguage using, in perl
>    Unicode::Collate::Locale with locale @documentlanguage and in C
>    strxfrm_l with "@documentlanguage.utf-8" (at least on GNU/Linux,
>    the locale name setup for strxfrm_l could be different on other platforms).
> 4) sorting based on a customization variable, such as COLLATION_LANGUAGE.
>    it would be the same as the previous one, with @documentlanguage
>    replaced by COLLATION_LANGUAGE.
> 5) sorting based on the user locale, using strxfrm in C and
>    "use locale" and regular sorting on unicode (internal perl encoded) strings
>    in Perl.
> 

My concern here is that there are far too many options for the user to
decide between.  They also interact with whether XS or pure Perl modules
are being used (which depends on environment variables such as
TEXINFO_XS_STRUCTURE and other things).  As far as possible the
interface should not specify whether the sorting is done in C or Perl.

What's of interest to the user are the following three things: speed,
correctness, and language-specific tailoring.

I think many possibilities can be covered with three customization
variables, USE_UNICODE_COLLATION, COLLATION_LOCALE and COLLATION_LANGUAGE:

1) would be done with USE_UNICODE_COLLATION=0 as you say.  This could
also be implemented in C with strcmp (as Andreas pointed out).

2) is two different types of sorting; as you said earlier, the
sorting in C may have a different treatment of "variable elements".
The first would be accessed with USE_UNICODE_COLLATION=1 and the second
with COLLATION_LOCALE=en_US.UTF-8 or possibly COLLATION_LOCALE=en_US.
Using strxfrm with en_US would not be the default because of the handling
of spaces and also because the interface isn't very portable.

3) and 4) are again potentially different between C and pure
Perl.  I propose that COLLATION_LOCALE would be used for accessing
system locales (with strxfrm or strcoll in C, but in theory this is
language-independent.)  COLLATION_LANGUAGE would be an argument to use
for Unicode::Collate::Locale to get language-specific tailoring, which
in language-independent terms means to use the UCA with tailoring, with
variable collation elements treated as "non-ignorable".  If there is
ever a separate implementation of the UCA in texi2any with access to
tailoring, COLLATION_LANGUAGE would govern it as well.

For 3), accessing @documentlanguage seems like an unnecessary extra
at the moment.  Again, there would be the problem of strxfrm_l and
Unicode::Collate::Locale doing different things with variable collation
elements.  There is no guarantee that the user has the appropriate
locale installed either (for use with strxfrm_l) or that the language
is supported by Unicode::Collate::Locale.

I'm ok with 5) not being implemented (using LC_COLLATE from the user's
locale).

Does this all look okay?  Have I missed anything?

> I forgot about one possibility, until there is a possibility to have
> Non-ignorable Weighting in C it could make sense to have as another
> possibility for C, the possibility to call perl code to obtain 2), which
> would lead to
> 
> 6) in C use Perl sorting corresponding to 2).
>
> Could be named 'perldefault'.

I can't understand what you are proposing here.  Is this not just the
same as using Unicode::Collate?  What difference does it make if
the Unicode::Collate module is called from C or Perl code?

> 1) and 2) are already implemented and currently customized with
> USE_UNICODE_COLLATION.  I do not think that we need 5), but we could
> implement it if users ask for it.  We do not need to implement the other
> options right away, but we may want to think about the way to select
> those options such as not to change the customization options when they
> are implemented.  I think that the options are
> 
> * use only one variable with a textual value, for example with, for 1-5
>   above
>   USE_COLLATION=basic/default/documentlanguage/custom/locale
> * use different variables as switches between the different options, for
>   instance USE_UNICODE_COLLATION to switch to 1), and more or less
>   one variable for each of the other possibilities.

I think it's tiresome for a user to read through a list of collation
options in the documentation.

Especially for the "custom" language, "-c COLLATION_LANGUAGE=se_SE" is
much more concise than "-c USE_COLLATION=custom -c COLLATION_LANGUAGE=se_SE"
(or, as I suggested "LC_COLLATE=se_SE texi2any -c USE_COLLATION=locale").
It's always better if the user can specify a single setting rather than
having to use several in combination.

"default" is not a good name for an option value, in my opinion, as all
it says is that that option is the default, but not anything about what
it means.

Re: index sorting in texi2any in C issue with spaces

Reply via email to