https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776
--- Comment #20 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
Thanks, I'd be very happy if such a relatively clear implementation could make
it!
> branchfree code is always better.
Don't say it like that. Smart branching, making use of how static
branch-prediction works, can speed up code significantly. You don't want to
compute everything when 99.9% of the inputs need only a fraction of the work.
TYPE Latency Speedup Throughput
Speedup
[cycles/call] [per value] [cycles/call] [per
value]
float, simd_abi::scalar 48.1 1 17
1
float, std::hypot 43.3 1.11 12.3
1.39
float, hypot3_scale 31.7 1.52 22.3
0.764
float, hypot3_exp 83.9 0.574 84.5
0.201
--------------------------------------------------------------------------------------
TYPE Latency Speedup Throughput
Speedup
[cycles/call] [per value] [cycles/call] [per
value]
double, simd_abi::scalar 54.7 1 15
1
double, std::hypot 53.8 1.02 19
0.79
double, hypot3_scale 44 1.24 24
0.625
double, hypot3_exp 91.3 0.599 91
0.165
and with -ffast-math:
TYPE Latency Speedup Throughput
Speedup
[cycles/call] [per value] [cycles/call] [per
value]
float, simd_abi::scalar 48.9 1 9.15
1
float, std::hypot 53.2 0.918 8.31
1.1
float, hypot3_scale 31.3 1.56 14
0.652
float, hypot3_exp 55.9 0.874 21.5
0.425
--------------------------------------------------------------------------------------
TYPE Latency Speedup Throughput
Speedup
[cycles/call] [per value] [cycles/call] [per
value]
double, simd_abi::scalar 54.8 1 9.07
1
double, std::hypot 61.5 0.891 11.3
0.805
double, hypot3_scale 40.8 1.34 12.1
0.753
double, hypot3_exp 64.2 0.853 23.3
0.39
I have not tested correctness or precision yet. Also, the benchmark only uses
inputs that do not require anything else than √x²+y²+z² (which, I believe,
should be the common input and thus optimized for).