https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085
--- Comment #4 from Steven Munroe <munroesj at gcc dot gnu.org> ---
I am seeing this a similar problem with union transfers from __float128 to
__int128.
static inline unsigned __int128
vec_xfer_bin128_2_int128t (__binary128 f128)
{
__VF_128 vunion;
vunion.vf1 = f128;
return (vunion.ui1);
}
and
unsigned __int128
test_xfer_bin128_2_int128 (__binary128 f128)
{
return vec_xfer_bin128_2_int128t (f128);
}
generates:
0000000000000030 <test_xfer_bin128_2_int128>:
30: 57 12 42 f0 xxswapd vs34,vs34
34: 20 00 20 39 li r9,32
38: d0 ff 41 39 addi r10,r1,-48
3c: 99 4f 4a 7c stxvd2x vs34,r10,r9
40: f0 ff 61 e8 ld r3,-16(r1)
44: f8 ff 81 e8 ld r4,-8(r1)
48: 20 00 80 4e blr
For POWER8 should use mfvsrd/xxpermdi/mfvsrd.
This looks like the root cause of poor performance for __float128 soft-float on
POWER8. A simple benchmark using __float128 in C code calling libgcc for
-mcpu=power8 and then hardware instructions for -mcpu=power9.
P8 target P8AT14, Uses libgcc __addkf3_sw and __mulkf3_sw:
test_time_f128 f128 CC tb delta = 52589, sec = 0.000102713
P9 Target P8AT14, Uses libgcc __addkf3_hw and __mulkf3_hw:
test_time_f128 f128 CC tb delta = 18762, sec = 3.66445e-05
P9 Target P9AT14, inline hardware binary128 float:
test_time_f128 f128 CC tb delta = 3809, sec = 7.43945e-06
I used Valgrind Itrace and Sim-ppc and perfstat analysis. Every call to libgcc
__add/sub/mul/divkf3 takes a load-hit-store flush every call. This explains why
__float128 is so 13.8 X slower on P8 then P9.