On 15/10/23 11:59, Bruno Haible wrote:
> With the new benchmark in place, I measured the run time of
> - the glibc 2.35 implementation of totalorder,
> - the gnulib implementation (picked by configuring with
> gl_cv_func_totalorder_in_libm=no gl_cv_func_totalorder_no_libm=no \
> gl_cv_func_totalorderf_in_libm=no gl_cv_func_totalorderf_no_libm=no \
> gl_cv_func_totalorderl_in_libm=no gl_cv_func_totalorderl_no_libm=no \
> - the gnulib implementation with some disabled NaN tests.
> This change (see attached patch) is correct: it still passes the unit
> tests.
>
> Here are the running times (on x86_64) of "./bench-totalorder fdl 1000000":
>
> f d l
>
> glibc 1.816 1.671 2.078
> gnulib 1.445 1.425 8.690
> gnulib with patch 1.798 1.974 14.032
>
> Conclusion:
> * My patch is a slowdown. It apparently "optimized" the fast path away. :-D
> * The gnulib implementation is significantly faster than glibc, except for
> the long-double case. I'll redo the measurements on various CPU types and
> then tell the glibc people...
Most of the old math implementations in libm do not take in consideration recent
compiler optimization, such as builtin for nan/inf checks; and also tries to
favor integer code over floating poin. Recent implementations, like
the ones provided by ARM optimized routines and hypot/fmod/exp10, try to
improve by
leveraging both the compiler, use better algorithms, and favor FP.
Also, did you use the same compiler flags / environment as usually distro does?
On simple algorithms like this, -fstack-protector can be quite a hit; as well
PLT overhead.
I also checked the resulting code and it is larger than glibc one for double
(375 vs 375 using same compiler and flags), but it should not really matter.
>
> Kudos to you, Paul, for an implementation that is not only standards-compliant
> and portable, but also faster than glibc!
If you may, it would be good to have such improvement on glibc as well. For
math code we have some benchmark on benchtest.
>
> Bruno
>