https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583
--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #2) > Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can > do the interleaving scheme as non-SLP vectorization shows. > > gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load > permutes. > The original SLP attempt still has > > node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int > op template: _2 = q[_1]; > stmt 0 _2 = q[_1]; > stmt 1 _8 = q[_7]; > stmt 2 _14 = q[_13]; > stmt 3 _20 = q[_19]; > load permutation { 0 1 2 3 } > node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int > op template: _4 = q[_3]; > stmt 0 _4 = q[_3]; > stmt 1 _10 = q[_9]; > stmt 2 _16 = q[_15]; > stmt 3 _22 = q[_21]; > load permutation { 4 5 6 7 } > > instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract > the lo/hi parts (which is also extract even/odd, but with a larger mode > encompassing 4 elements). > > I'd say for VLA operation this is one of the major blockers for all-SLP. I'll take a look if Richard hasn't yet once I finish early break transition :) .