https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340

--- Comment #8 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I went with your approach and performed some local testing.
What I did is add another "unrolling type" in cunrolli, UL_FOR_GAPS, and split
it off as a third cunrolli invocation.
Right now it analyses the loop for gaps and completely unrolls if the
preconditions are met.

For the loop I mentioned above this works and it also appears to work fairly
well for the original x264 loop.  However, only with 512b vectors the code is
as expected.  For e.g. 256b the resulting vector code seems worse to me.

I haven't done any further analysis or benchmarking but at first sight it looks
like non-loop SLP will produce more complicated (and less efficient) code than
when loop-vectorizing the non-unrolled loop.  I guess that's at least partially
expected but of course complicates "costing" of the approach.

One idea was to only allow "gap unrolling" if the data of the completely
unrolled loop fits one vector register/mode (like here 16 ints).  My worry
would be that this would make it very specific, even more so than before, and
barely ever trigger.

Reply via email to