https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Richard Biener from comment #3) > (In reply to Hongtao.liu from comment #2) > > Disable (define_insn "*sse3_haddv2df3_low" and (define_insn > > "*sse3_hsubv2df3_low" seems to be ok. > > But for foo1. > > > > v2df foo1 (v2df x, v2df y) > > { > > v2df a; > > a[0] = x[0] + x[1]; > > a[1] = y[0] + y[1]; > > return a; > > } > > > > it's > > > > vhaddpd %xmm1, %xmm0, %xmm0 > > ret > > > > vs > > > > movapd xmm2, xmm0 > > unpckhpd xmm2, xmm2 > > addsd xmm0, xmm2 > > movapd xmm2, xmm1 > > unpckhpd xmm1, xmm1 > > addsd xmm1, xmm2 > > unpcklpd xmm0, xmm1 > > ret > > > > and note w/o vhaddpd, codegen can be optimized to > > > > movapd xmm2, xmm0 > > unpcklpd xmm2, xmm1 > > unpckhpd xmm0, xmm1 > > addpd xmm0, xmm2 > > ret > > > > Guess maybe it's better done in gimple level? > > On GIMPLE we see the testcase basically unchanged from what the source does: > > _1 = BIT_FIELD_REF <x_7(D), 64, 0>; > _2 = BIT_FIELD_REF <x_7(D), 64, 64>; > _3 = _1 + _2; > a_9 = BIT_INSERT_EXPR <a_8(D), _3, 0>; > _4 = BIT_FIELD_REF <y_10(D), 64, 0>; > _5 = BIT_FIELD_REF <y_10(D), 64, 64>; > _6 = _4 + _5; > a_11 = BIT_INSERT_EXPR <a_9, _6, 64>; > return a_11; > > vectorization fails in SLP discovery because we essentially see two lanes > operating on different vectors and we don't implement a way to shuffle > them together. > > I think the full hadd define_insns are OK to keep, they really have special > arrangements (esp. the SFmode variants). But the reductions to scalar > (*_low) seem unnecessary and penaltizing (maybe we can guard use of those > with a -mtune-ctl?). > Yes, i'm add a tune to enabled v2df vector reduction and defaut disabled for all processors. > I also see we're missing patterns for h{add,sub}ps (not sure if we can manage > to get combine to synthesize it). You mean (define_insn "sse3_h<insn>v4sf3"?