use getSortKey in Unicode::Collate

Gavin Smith Mon, 13 Feb 2023 00:47:50 -0800

> Other than that I do not have much other idea than disabling it, for
> instance if documentlanguage is en.  The result with Unicode::Collate is
> better for accented letters, but not so useful in english.  There could
> even be a customization variable to use Unicode::Collate even in
> english.


Another possibility is to use getSortKey:

       "$sortKey = $Collator->getSortKey($string)"
           -- see 4.3 Form Sort Key, UTS #10.

           Returns a sort key.

           You compare the sort keys using a binary comparison and get the
           result of the comparison of the strings using UCA.

              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)

                 is equivalent to

              $Collator->cmp($a, $b)

>From perlperf man page:

       Using a subroutine as part of your sort is a powerful way to get
       exactly what you want, but will usually be slower than the built-in
       alphabetic "cmp" and numeric "<=>" sort operators.  It is possible to
       make multiple passes over your data, building indices to make the
       upcoming sort more efficient, and to use what is known as the "OM"
       (Orcish Maneuver) to cache the sort keys in advance.  The cache lookup,
       while a good idea, can itself be a source of slowdown by enforcing a
       double pass over the data - once to setup the cache, and once to sort
       the data.  Using "pack()" to extract the required sort key into a
       consistent string can be an efficient way to build a single string to
       compare, instead of using multiple sort keys, which makes it possible
       to use the standard, written in "c" and fast, perl "sort()" function on
       the output, and is the basis of the "GRT" (Guttman Rossler Transform).
       Some string combinations can slow the "GRT" down, by just being too
       plain complex for its own good.

We could try caching sort keys and see if it is fast enough.  If so, we
could still use Unicode::Collate without any setting for this.

use getSortKey in Unicode::Collate

Reply via email to