Re: texi2any is too slow because of Unicode::Collate

Gavin Smith Sat, 11 Feb 2023 12:30:38 -0800

On Sat, Feb 11, 2023 at 10:02:55PM +0200, Eli Zaretskii wrote:
> > From: Gavin Smith <gavinsmith0...@gmail.com>
> > Date: Sat, 11 Feb 2023 19:46:12 +0000
> > 
> > On Sat, Feb 11, 2023 at 08:04:15PM +0100, Patrice Dumas wrote:
> > > Other than that I do not have much other idea than disabling it, for
> > > instance if documentlanguage is en.  The result with Unicode::Collate is
> > > better for accented letters, but not so useful in english.  There could
> > > even be a customization variable to use Unicode::Collate even in
> > > english.
> > 
> > I think it's a good idea to disable it for "en" at least, along with
> > a customization variable.
> 
> How many manuals set documentlanguage?  With the proliferation of
> documentencoding set to UTF-8, I think disabling the collation for
> "en" will be next to futile.


If I understand correctly, until recently more standard Perl facilities
were used for sorting the indices, but this produced worse results for
non-English text, such as that containing many accented characters.
Unicode::Collate is used to sort the indices "properly".  Use of UTF-8
may not be a relevant factor.

Could we investigate further which languages it causes a problem for?
The old method might be okay for more languages than just English.

> How come format_printindex takes such a large proportion of the
> processing?  Isn't that strange?  Index entries are usually a small
> proportion of the overall manual's text, so processing the manual
> should take the lion's share.  The index in the manual you were timing
> has about 8K entries, but the entire manual is 100K lines, so the
> index is less than 10% of the total volume.  How come its processing
> is so expensive?

It's the sorting of the index entries into alphabetical order, I presume.
There isn't a similar sorting process for the rest of the manual.

Re: texi2any is too slow because of Unicode::Collate

Reply via email to