https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340
--- Comment #8 from Robin Dapp <rdapp at gcc dot gnu.org> --- I went with your approach and performed some local testing. What I did is add another "unrolling type" in cunrolli, UL_FOR_GAPS, and split it off as a third cunrolli invocation. Right now it analyses the loop for gaps and completely unrolls if the preconditions are met. For the loop I mentioned above this works and it also appears to work fairly well for the original x264 loop. However, only with 512b vectors the code is as expected. For e.g. 256b the resulting vector code seems worse to me. I haven't done any further analysis or benchmarking but at first sight it looks like non-loop SLP will produce more complicated (and less efficient) code than when loop-vectorizing the non-unrolled loop. I guess that's at least partially expected but of course complicates "costing" of the approach. One idea was to only allow "gap unrolling" if the data of the completely unrolled loop fits one vector register/mode (like here 16 ints). My worry would be that this would make it very specific, even more so than before, and barely ever trigger.