https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #0)
> typedef double v2df __attribute__((vector_size(16)));
> double foo (v2df x, double y)
> {
>   return x[0] + x[1] + y;
> }
> double bar (v2df x, double y)
> {
>   return y + x[0] + x[1];
> }
> 
> with -O2 -mavx2 -mtune=znver2 ends up generating
> 
> foo:
> .LFB0:
>         .cfi_startproc
>         vhaddpd %xmm0, %xmm0, %xmm0
>         vaddsd  %xmm1, %xmm0, %xmm0
>         ret
> 
> bar:
> .LFB1:
>         .cfi_startproc
>         vmovapd %xmm0, %xmm2
>         vaddsd  %xmm1, %xmm0, %xmm0
>         vunpckhpd       %xmm2, %xmm2, %xmm2
>         vaddsd  %xmm2, %xmm0, %xmm0
>         ret
> 
> where bar should be a _lot_ better according to Agner which says
> that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only
> one per two cycles while both vunpckhpd and vaddsd fare a lot better here.
> Coffee-lake isn't much better here.
> 
> Maybe we want to disable the V2DF instructions for most tunings somehow?

Bar is also better on CLX and ICL.

Reply via email to