https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 8 Jan 2025, rdapp.gcc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340 > > --- Comment #4 from rdapp.gcc at gmail dot com --- > > That said - if DR analysis could, say, "force" a particular VF where it > > knows that gaps are closed we might "virtually" unroll this and thus > > detect it as a group of contiguous 16 stores. Now we'd need to do the > > same virtual unrolling for all other stmts of course. > > > > I think it would be easier if we'd somehow detect this situation beforehand > > and actually perform the unrolling - we might want to do it with a > > if (.LOOP_VECTORIZED (...)) versioning scheme though. I do wonder how > > common such loops are though. > > > > It might be also possible to override cost considerations of early > > unrolling with -O3 (aka when vectorization is enabled) and when the > > number of iterations matches the gap of related DRs (but as said, it > > looks like a very special thing to do). > > > > That said - I do plan to change the vectorizer from iterating over modes > > to iterating over VFs which means we could perform the unrolling implied > > by the VF on the vectorizer IL (SLP) and (re-)perform group discovery > > afterwards. > > > > For a more general loop we'd essentially apply blocking with the desired > > VF, unroll that blocking loop and apply BB vectorization. > > > > So to make the point - I don't like how handling this special case within > > the current vectorizer framework pays off with the cost this will have > > (I'm not sure it's really feasible to add even). Instead this looks > > like in need of a vectorization enablement pre-transform to me. > > OK, sounds reasonable. And yeah, I wouldn't claim this kind of loop is > common, > it's obviously an x264 thing. Perhaps in other codecs but I haven't really > checked. > > Another thought I had as we already know that SLP handles this more > gracefully: > Would it make sense to "just" defer to BB vectorization and have loop > vectorization not do anything, provided we could detect the pattern with > certainty? That would still be special casing the situation but potentially > less intrusive than "Hail Mary" unrolling. Yes, I would expect costing to ensure we don't loop vectorize it, but then we don't (and can't easily IMO) compare loop vectorization to basic-block vectorization after unrolling cost-wise, so ...