SLP vectorization possible inefficiency

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 08 Jan 2025 02:26:18 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340


--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that when we treat this as a group the same group in the next
iteration will overlap - this isn't something we support (we'd have to
alter dependence analysis to consider overlap with gaps as no overlap).

It's really a hard problem and much easier to BB vectorize when unrolled.

That said - if DR analysis could, say, "force" a particular VF where it
knows that gaps are closed we might "virtually" unroll this and thus
detect it as a group of contiguous 16 stores.  Now we'd need to do the
same virtual unrolling for all other stmts of course.

I think it would be easier if we'd somehow detect this situation beforehand
and actually perform the unrolling - we might want to do it with a
if (.LOOP_VECTORIZED (...)) versioning scheme though.  I do wonder how
common such loops are though.

It might be also possible to override cost considerations of early
unrolling with -O3 (aka when vectorization is enabled) and when the
number of iterations matches the gap of related DRs (but as said, it
looks like a very special thing to do).

That said - I do plan to change the vectorizer from iterating over modes
to iterating over VFs which means we could perform the unrolling implied
by the VF on the vectorizer IL (SLP) and (re-)perform group discovery
afterwards.

For a more general loop we'd essentially apply blocking with the desired
VF, unroll that blocking loop and apply BB vectorization.

So to make the point - I don't like how handling this special case within
the current vectorizer framework pays off with the cost this will have
(I'm not sure it's really feasible to add even).  Instead this looks
like in need of a vectorization enablement pre-transform to me.

[Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency

Reply via email to