On Sat, Feb 11, 2023 at 10:02:55PM +0200, Eli Zaretskii wrote: > > From: Gavin Smith <gavinsmith0...@gmail.com> > > Date: Sat, 11 Feb 2023 19:46:12 +0000 > > > > On Sat, Feb 11, 2023 at 08:04:15PM +0100, Patrice Dumas wrote: > > > Other than that I do not have much other idea than disabling it, for > > > instance if documentlanguage is en. The result with Unicode::Collate is > > > better for accented letters, but not so useful in english. There could > > > even be a customization variable to use Unicode::Collate even in > > > english. > > > > I think it's a good idea to disable it for "en" at least, along with > > a customization variable. > > How many manuals set documentlanguage? With the proliferation of > documentencoding set to UTF-8, I think disabling the collation for > "en" will be next to futile.
If I understand correctly, until recently more standard Perl facilities were used for sorting the indices, but this produced worse results for non-English text, such as that containing many accented characters. Unicode::Collate is used to sort the indices "properly". Use of UTF-8 may not be a relevant factor. Could we investigate further which languages it causes a problem for? The old method might be okay for more languages than just English. > How come format_printindex takes such a large proportion of the > processing? Isn't that strange? Index entries are usually a small > proportion of the overall manual's text, so processing the manual > should take the lion's share. The index in the manual you were timing > has about 8K entries, but the entire manual is 100K lines, so the > index is less than 10% of the total volume. How come its processing > is so expensive? It's the sorting of the index entries into alphabetical order, I presume. There isn't a similar sorting process for the rest of the manual.