https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Richard Biener from comment #0) > typedef double v2df __attribute__((vector_size(16))); > double foo (v2df x, double y) > { > return x[0] + x[1] + y; > } > double bar (v2df x, double y) > { > return y + x[0] + x[1]; > } > > with -O2 -mavx2 -mtune=znver2 ends up generating > > foo: > .LFB0: > .cfi_startproc > vhaddpd %xmm0, %xmm0, %xmm0 > vaddsd %xmm1, %xmm0, %xmm0 > ret > > bar: > .LFB1: > .cfi_startproc > vmovapd %xmm0, %xmm2 > vaddsd %xmm1, %xmm0, %xmm0 > vunpckhpd %xmm2, %xmm2, %xmm2 > vaddsd %xmm2, %xmm0, %xmm0 > ret > > where bar should be a _lot_ better according to Agner which says > that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only > one per two cycles while both vunpckhpd and vaddsd fare a lot better here. > Coffee-lake isn't much better here. > > Maybe we want to disable the V2DF instructions for most tunings somehow? Bar is also better on CLX and ICL.