https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #0)
> typedef double v2df __attribute__((vector_size(16)));
> double foo (v2df x, double y)
> {
> return x[0] + x[1] + y;
> }
> double bar (v2df x, double y)
> {
> return y + x[0] + x[1];
> }
>
> with -O2 -mavx2 -mtune=znver2 ends up generating
>
> foo:
> .LFB0:
> .cfi_startproc
> vhaddpd %xmm0, %xmm0, %xmm0
> vaddsd %xmm1, %xmm0, %xmm0
> ret
>
> bar:
> .LFB1:
> .cfi_startproc
> vmovapd %xmm0, %xmm2
> vaddsd %xmm1, %xmm0, %xmm0
> vunpckhpd %xmm2, %xmm2, %xmm2
> vaddsd %xmm2, %xmm0, %xmm0
> ret
>
> where bar should be a _lot_ better according to Agner which says
> that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only
> one per two cycles while both vunpckhpd and vaddsd fare a lot better here.
> Coffee-lake isn't much better here.
>
> Maybe we want to disable the V2DF instructions for most tunings somehow?
Bar is also better on CLX and ICL.