[Bug target/97147] GCC uses vhaddpd which is bad for latency

crazylht at gmail dot com via Gcc-bugs Tue, 17 Aug 2021 00:17:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147


--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > Disable (define_insn "*sse3_haddv2df3_low" and (define_insn
> > "*sse3_hsubv2df3_low" seems to be ok.
> > But for foo1.
> > 
> > v2df foo1 (v2df x, v2df y)
> > {
> >   v2df a;
> >   a[0] = x[0] + x[1];
> >   a[1] = y[0] + y[1];
> >   return a;
> > }
> > 
> > it's 
> > 
> >   vhaddpd %xmm1, %xmm0, %xmm0
> >   ret
> > 
> > vs 
> > 
> >         movapd  xmm2, xmm0
> >         unpckhpd        xmm2, xmm2
> >         addsd   xmm0, xmm2
> >         movapd  xmm2, xmm1
> >         unpckhpd        xmm1, xmm1
> >         addsd   xmm1, xmm2
> >         unpcklpd        xmm0, xmm1
> >         ret
> > 
> > and note w/o vhaddpd, codegen can be optimized to 
> > 
> >         movapd  xmm2, xmm0
> >         unpcklpd        xmm2, xmm1
> >         unpckhpd        xmm0, xmm1
> >         addpd   xmm0, xmm2
> >         ret
> > 
> > Guess maybe it's better done in gimple level?
> 
> On GIMPLE we see the testcase basically unchanged from what the source does:
> 
>   _1 = BIT_FIELD_REF <x_7(D), 64, 0>;
>   _2 = BIT_FIELD_REF <x_7(D), 64, 64>;
>   _3 = _1 + _2;
>   a_9 = BIT_INSERT_EXPR <a_8(D), _3, 0>;
>   _4 = BIT_FIELD_REF <y_10(D), 64, 0>;
>   _5 = BIT_FIELD_REF <y_10(D), 64, 64>;
>   _6 = _4 + _5;
>   a_11 = BIT_INSERT_EXPR <a_9, _6, 64>;
>   return a_11;
> 
> vectorization fails in SLP discovery because we essentially see two lanes
> operating on different vectors and we don't implement a way to shuffle
> them together.
> 
> I think the full hadd define_insns are OK to keep, they really have special
> arrangements (esp. the SFmode variants).  But the reductions to scalar
> (*_low) seem unnecessary and penaltizing (maybe we can guard use of those
> with a -mtune-ctl?).
> 

Yes, i'm add a tune to enabled v2df vector reduction and defaut disabled for
all processors.

> I also see we're missing patterns for h{add,sub}ps (not sure if we can manage
> to get combine to synthesize it).

You mean (define_insn "sse3_h<insn>v4sf3"?

[Bug target/97147] GCC uses vhaddpd which is bad for latency

Reply via email to