https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #5) > (In reply to Richard Biener from comment #1) > > Hmm, I think the issue is we see > > > > f (__m128d x, __m128d y, __m128d z) > > { > > vector(2) double _4; > > vector(2) double _6; > > > > <bb 2> [100.00%]: > > _4 = x_2(D) * y_3(D); > > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] > We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> > VEC_FMADDSUB in match.pd? I think MUL + .VEC_ADDSUB can be handled in the FMA pass. For my example above we early (before FMA recog) get _4 = x_2(D) * y_3(D); tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>; we could recognize that as .VEC_ADDSUB. I think we want to avoid doing this too early, not sure if doing this within the FMA pass itself will work since we key FMAs on the mult but would need to key the addsub on the VEC_PERM (we are walking stmts from BB start to end). Looking at the code it seems changing the walking order should work. Note matching tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>; to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly creates .VEC_ADDSUB when possible).