On Mon, Feb 13, 2023 at 08:47:26AM +0000, Gavin Smith wrote: > > Other than that I do not have much other idea than disabling it, for > > instance if documentlanguage is en. The result with Unicode::Collate is > > better for accented letters, but not so useful in english. There could > > even be a customization variable to use Unicode::Collate even in > > english. > > Another possibility is to use getSortKey: > > "$sortKey = $Collator->getSortKey($string)" > -- see 4.3 Form Sort Key, UTS #10. > > Returns a sort key. > > You compare the sort keys using a binary comparison and get the > result of the comparison of the strings using UCA. > > $Collator->getSortKey($a) cmp $Collator->getSortKey($b) > > is equivalent to > > $Collator->cmp($a, $b) > > From perlperf man page: > > Using a subroutine as part of your sort is a powerful way to get > exactly what you want, but will usually be slower than the built-in > alphabetic "cmp" and numeric "<=>" sort operators. It is possible to > make multiple passes over your data, building indices to make the > upcoming sort more efficient, and to use what is known as the "OM" > (Orcish Maneuver) to cache the sort keys in advance. The cache lookup, > while a good idea, can itself be a source of slowdown by enforcing a > double pass over the data - once to setup the cache, and once to sort > the data. Using "pack()" to extract the required sort key into a > consistent string can be an efficient way to build a single string to > compare, instead of using multiple sort keys, which makes it possible > to use the standard, written in "c" and fast, perl "sort()" function on > the output, and is the basis of the "GRT" (Guttman Rossler Transform). > Some string combinations can slow the "GRT" down, by just being too > plain complex for its own good. > > We could try caching sort keys and see if it is fast enough. If so, we > could still use Unicode::Collate without any setting for this.
Ok, I'll propose a change, it would be simpler to avoid any use of Unicode::Collate by using getSortKey too. -- Pat