https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- It's interesting to note that in - _820 = {_187, _189, _187, _189}; - vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820); - vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821; - vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821; - _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>; we only need parts of the CTOR for the add/sub parts (because we ignore some lanes with the blend). That might even allow to elide the final compose of the low/high part and expose some more insn parallelism. Of course that looks quite difficult to achieve. -- Note your CTOR cost estimates might be off given the CTORs are mostly regular like { _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _48, _48, _48, _48 } thus could use 4 splats to xmm and 4 inserts? For the V4SI vectorization we unfortunately decide to do t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Building parent vector operands from scalars instead and thus end up with { _49, _50, _49, _50 }. That said, I don't think the backend gets easy access to the actual CTOR layout yet to improve costing (similar as with permutes and the actual permute mask). -- It's difficult (if not impossible) for the vectorizer to second-guess the followup FRE, we're a long way from doing loop + SLP vectorization in one go and discover we can elide the vector store.