https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Richard Biener from comment #1)
> > Hmm, I think the issue is we see
> > 
> > f (__m128d x, __m128d y, __m128d z)
> > {
> >   vector(2) double _4;
> >   vector(2) double _6;
> > 
> >   <bb 2> [100.00%]:
> >   _4 = x_2(D) * y_3(D);
> >   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
> We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB ->
> VEC_FMADDSUB in match.pd?

I think MUL + .VEC_ADDSUB can be handled in the FMA pass.  For my example
above we early (before FMA recog) get

  _4 = x_2(D) * y_3(D);
  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>;

we could recognize that as .VEC_ADDSUB.  I think we want to avoid doing
this too early, not sure if doing this within the FMA pass itself will
work since we key FMAs on the mult but would need to key the addsub
on the VEC_PERM (we are walking stmts from BB start to end).  Looking
at the code it seems changing the walking order should work.

Note matching

  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>;

to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
creates .VEC_ADDSUB when possible).

Reply via email to