https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's interesting to note that in

-  _820 = {_187, _189, _187, _189};
-  vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820);
-  vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821;
-  vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821;
-  _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>;

we only need parts of the CTOR for the add/sub parts (because we ignore
some lanes with the blend).  That might even allow to elide the final
compose of the low/high part and expose some more insn parallelism.

Of course that looks quite difficult to achieve.

--

Note your CTOR cost estimates might be off given the CTORs are mostly
regular like

{ _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _48,
_48, _48, _48 }

thus could use 4 splats to xmm and 4 inserts?  For the V4SI vectorization
we unfortunately decide to do

t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Building parent vector operands from scalars instead

and thus end up with { _49, _50, _49, _50 }.  That said, I don't think
the backend gets easy access to the actual CTOR layout yet to improve costing
(similar as with permutes and the actual permute mask).

--

It's difficult (if not impossible) for the vectorizer to second-guess
the followup FRE, we're a long way from doing loop + SLP vectorization
in one go and discover we can elide the vector store.

Reply via email to