On Thu, May 10, 2018 at 06:56:39PM +0300, Alexander Monakov wrote:
> Overall the new implementation is roughly 30% faster compared to Glibc qsort,
> with 2x or more speedup for cases with tiny element count. I see one instance
> where the new approach is significantly (1.5x) slower: it is ipa-icf.c:
> sort_congruence_class_groups_by_decl_uid. It sorts a big array (1500 entries)
> and needs 14 indirect loads just to reach values to compare, so when branch
> prediction manages to guess correctly, it allows to overlap execution of two
> comparators and better hide their cache miss latency.
> 
> Overall GCC spends about 0.3% time under qsort, but this doesn't automatically
> mean that this brings a 0.1% speed improvement: it may be larger or smaller
> depending on how new code affects cache behavior and branch predictors in
> other code, and it's not trivial to measure precisely.
> 
> I can go into more detail about measured stats if there's interest :)
> 
> Makefile.in changes are separated to patch 2 in the hope it'd make review
> easier, but the two patches will need to be applied together.
> 
> Bootstrapped/regtested on x86-64, OK for trunk?

Have you gathered some statistics on the element sizes and how often they
appear in qsort calls (perhaps weighted by n*log(n) of the element count)
across bootstrap+regtest?

glibc uses indirect sorting (sorts pointers to the elements or indexes and
just reshuffles at the end) and has special case for the most commonly used
small element size (4/8).  With C++ templates you could achieve that even
without macros, just by instantiating the mergesort and its helpers for the
few common cases.  Or is that not worth it (as in, we never sort really
large (say > 32 bytes) element sizes and the 4 or 8 bytes long element sizes
aren't common enough to see benefit by using constant size memcpy for those
cases?

        Jakub

Reply via email to