Re: index sorting in texi2any in C issue with spaces

Patrice Dumas Sun, 04 Feb 2024 11:39:10 -0800

On Sun, Feb 04, 2024 at 03:42:22PM +0000, Gavin Smith wrote:
> On Sun, Feb 04, 2024 at 12:17:16PM +0100, Patrice Dumas wrote:
> > On Thu, Feb 01, 2024 at 10:16:07PM +0000, Gavin Smith wrote:
> > > An alternative is not to have such a variable but just to have an option
> > > to collate according to the user's locale.  Then the user would run e.g.
> > > "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the 
> > > ll_LL.UTF-8
> > > locale.  They would have to have the locale installed that was appropriate
> > > for whichever manual they were processing (assuming the "variable 
> > > weighting"
> > > option is appropriate.)
> > 
> > I do not like that possibility, I think that we should avoid using the user
> > locales when it comes to document output in general.  If we use the user
> > locale I think that it should be by using strxfrm in C and "use locale" in
> > Perl, not by checking a specific LC_COLLATE value in the environment.
> 
> (Note that "cmp" is documented not to work with "use locale" for UTF-8
> strings:
> 
>        "lt", "le", "ge", "gt" and "cmp" use the collation (sort) order
>        specified by the current "LC_COLLATE" locale if a "use locale" form
>        that includes collation is in effect.  See perllocale.  Do not mix
>        these with Unicode, only use them with legacy 8-bit locale encodings.
>        The standard "Unicode::Collate" and "Unicode::Collate::Locale" modules
>        offer much more powerful solutions to collation issues.
> 
> - from "man perlop".)


Thanks.  This is very confusing to me, then, as it is not told that way
in perllocale, especially the section: 
https://perldoc.perl.org/perllocale#Category-LC_COLLATE%3A-Collation%3A-Text-Comparisons-and-Sorting
There is more information in the end of the page that may correspond
better to the perlop information.  Not important at all anyway
since we agree that using the user locale is not a good idea in any case.

> > Here is my updated thinking on the possibilities
> > 
> > 1) lexicographic sorting on unicode strings (corresponds to
> >                                  USE_UNICODE_COLLATION=0 currently)
> > 2) unicode default sorting obtained by Unicode::Collate in Perl and
> >    strxfrm_l in C with "en_US.utf-8", the current default ("en_US.utf-8"
> >    could be different on different platforms, a list instead of only one
> >    possibility if "en_US.utf-8" is not always available...)
> > 3) sorting based on @documentlanguage using, in perl
> >    Unicode::Collate::Locale with locale @documentlanguage and in C
> >    strxfrm_l with "@documentlanguage.utf-8" (at least on GNU/Linux,
> >    the locale name setup for strxfrm_l could be different on other 
> > platforms).
> > 4) sorting based on a customization variable, such as COLLATION_LANGUAGE.
> >    it would be the same as the previous one, with @documentlanguage
> >    replaced by COLLATION_LANGUAGE.
> > 5) sorting based on the user locale, using strxfrm in C and
> >    "use locale" and regular sorting on unicode (internal perl encoded) 
> > strings
> >    in Perl.
> > 
> 
> My concern here is that there are far too many options for the user to
> decide between.  They also interact with whether XS or pure Perl modules
> are being used (which depends on environment variables such as
> TEXINFO_XS_STRUCTURE and other things).  As far as possible the
> interface should not specify whether the sorting is done in C or Perl.

Ok, in principle, but I am not sure that it is really possible given the
differences.

> What's of interest to the user are the following three things: speed,
> correctness, and language-specific tailoring.

Point is that C is better for speed, but Perl is better for correctness.

> I think many possibilities can be covered with three customization
> variables, USE_UNICODE_COLLATION, COLLATION_LOCALE and COLLATION_LANGUAGE:
> 
> 1) would be done with USE_UNICODE_COLLATION=0 as you say.  This could
> also be implemented in C with strcmp (as Andreas pointed out).

strcmp is always used as a transformation on the string is done with
strxfrm_l for the collation in C.  If USE_UNICODE_COLLATION=0 the string
is not transformed, which amounts to using strcmp on the original
string.  Therefore it is already implemented that way in C, as can be
seen in tp/Texinfo/XS/main/manipulate_indices.c.

> 2) is two different types of sorting; as you said earlier, the
> sorting in C may have a different treatment of "variable elements".
> The first would be accessed with USE_UNICODE_COLLATION=1 and the second
> with COLLATION_LOCALE=en_US.UTF-8 or possibly COLLATION_LOCALE=en_US.
> Using strxfrm with en_US would not be the default because of the handling
> of spaces and also because the interface isn't very portable.

So, what you mean here is is that with USE_UNICODE_COLLATION=1 and no
COLLATION_LOCALE, C code should call the perl code that sort indices using
Unicode::Locale instead of doing the sorting in C.  Did I get it right?

If COLLATION_LOCALE is set, in C strxfrm_l would be used to do the
string transformation and sorting.

If COLLATION_LOCALE is set in Perl, it is not clear to me what would be
the output.  Would it be ignored?

The advantage I see of your proposal is that we would never need to
select a specific locale, as is done currently with en_US.UTF-8.
The downside is that in most cases, the users will not get the speed
increase of using C, as it requires knowing about COLLATION_LOCALE
which is likely to remain relatively obscure.  This downside is not
problematic right now, as Perl is more correct.  However, if there is a
possibility to get variable elements set to "non-ignorable" in C,
possibly by using an hardcoded locale of en_US, it will not possible to
get automatically both the correct and more rapid option.  The user
would still have to set COLLATION_LOCALE to get it.  So, even if it is
practical for the short time, I wonder if we should not already plan for
a future in which C would be both correct and more rapid, but still with
a cumbersome interface that requires setting a specific locale.

> 3) and 4) are again potentially different between C and pure
> Perl.  I propose that COLLATION_LOCALE would be used for accessing
> system locales (with strxfrm or strcoll in C, but in theory this is
> language-independent.)  COLLATION_LANGUAGE would be an argument to use
> for Unicode::Collate::Locale to get language-specific tailoring, which
> in language-independent terms means to use the UCA with tailoring, with
> variable collation elements treated as "non-ignorable".  If there is
> ever a separate implementation of the UCA in texi2any with access to
> tailoring, COLLATION_LANGUAGE would govern it as well.

If I understand well, COLLATION_LANGUAGE would only change what is done
in perl, with Unicode::Collate::Locale used if COLLATION_LANGUAGE is
set.  In that case, since perl is called from C if COLLATION_LOCALE
is not set, COLLATION_LANGUAGE would apply to C because it calls Perl
unless COLLATION_LOCALE is set.

Note that it is not more clear to me what would happen with
COLLATION_LOCALE in the Perl case.

> For 3), accessing @documentlanguage seems like an unnecessary extra
> at the moment.  Again, there would be the problem of strxfrm_l and
> Unicode::Collate::Locale doing different things with variable collation
> elements.  There is no guarantee that the user has the appropriate
> locale installed either (for use with strxfrm_l) 

It seems to me that following @documentlanguage would be more desirable
than being able to have the use specify a specific COLLATION_LANGUAGE
(or COLLATION_LOCALE).  Indeed, it seems to me to be more aligned with
Texinfo, in which information is supposed to come primarily from the
Texinfo manual.  Also COLLATION_LANGUAGE and COLLATION_LOCALE suffer from
the same problems that you describe for @documentlanguage based
customization.  Also, if COLLATION_LANGUAGE and/or COLLATION_LOCALE is
implemented, it would be very easy to use what comes from @documentlanguage
instead for any of these user-supplied values, so it is a bit strange
not to do it.

Lastly, and more importantly, even if it is implemented later, I think
that the 'interface' with customization variables should be designed now.

> or that the language
> is supported by Unicode::Collate::Locale.

This is not an issue, if not supported, there is a fallback to the
default behaviour of Unicode::Collate.

> > 6) in C use Perl sorting corresponding to 2).
> >
> > Could be named 'perldefault'.
> 
> I can't understand what you are proposing here.  Is this not just the
> same as using Unicode::Collate?  What difference does it make if
> the Unicode::Collate module is called from C or Perl code?

Speed and consistency.  My idea was that if C is used, it is supposed to
be used for everything in the default case, with exceptions only when
needed and mostly for tests (when TEST=1).  But I have no problem if
things are done differently.

As a side note, transliteration of file names is also different from C
and from Perl, the Perl function is used if TEST=1, but otherwise the
result are different if TEXINFO_XS_CONVERT=1.

-- 
Pat

Re: index sorting in texi2any in C issue with spaces

Reply via email to