SLP vectorization possible inefficiency

rguenther at suse dot de via Gcc-bugs Wed, 08 Jan 2025 03:58:25 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340


--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 8 Jan 2025, rdapp.gcc at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340
> 
> --- Comment #4 from rdapp.gcc at gmail dot com ---
> > That said - if DR analysis could, say, "force" a particular VF where it
> > knows that gaps are closed we might "virtually" unroll this and thus
> > detect it as a group of contiguous 16 stores.  Now we'd need to do the
> > same virtual unrolling for all other stmts of course.
> >
> > I think it would be easier if we'd somehow detect this situation beforehand
> > and actually perform the unrolling - we might want to do it with a
> > if (.LOOP_VECTORIZED (...)) versioning scheme though.  I do wonder how
> > common such loops are though.
> >
> > It might be also possible to override cost considerations of early
> > unrolling with -O3 (aka when vectorization is enabled) and when the
> > number of iterations matches the gap of related DRs (but as said, it
> > looks like a very special thing to do).
> >
> > That said - I do plan to change the vectorizer from iterating over modes
> > to iterating over VFs which means we could perform the unrolling implied
> > by the VF on the vectorizer IL (SLP) and (re-)perform group discovery
> > afterwards.
> >
> > For a more general loop we'd essentially apply blocking with the desired
> > VF, unroll that blocking loop and apply BB vectorization.
> >
> > So to make the point - I don't like how handling this special case within
> > the current vectorizer framework pays off with the cost this will have
> > (I'm not sure it's really feasible to add even).  Instead this looks
> > like in need of a vectorization enablement pre-transform to me.
> 
> OK, sounds reasonable.  And yeah, I wouldn't claim this kind of loop is 
> common,
> it's obviously an x264 thing.  Perhaps in other codecs but I haven't really
> checked.
> 
> Another thought I had as we already know that SLP handles this more 
> gracefully:
> Would it make sense to "just" defer to BB vectorization and have loop
> vectorization not do anything, provided we could detect the pattern with
> certainty?  That would still be special casing the situation but potentially
> less intrusive than "Hail Mary" unrolling.

Yes, I would expect costing to ensure we don't loop vectorize it, but then
we don't (and can't easily IMO) compare loop vectorization to
basic-block vectorization after unrolling cost-wise, so ...

[Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency

Reply via email to