https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 > > --- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > (In reply to Hongtao Liu from comment #9) > > The original case is a little different from the one in PR. > But the issue is similar, after cunrolli, GCC failed to vectorize the outer > loop. > > The interesting thing is in estimated_unrolled_size, the original unr_insns is > 288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn > is decreased by 1/3 due to > > Loop body is likely going to simplify further, this is difficult > to guess, we just decrease the result by 1/3. */ > > In practice, this loop body is not simplied for 1/3 of the instructions. > > Considering the unroll factor is 16, the unr_insn is large(192), I was > wondering if we could add some heuristic algorithm to avoid complete loop > unroll, because usually for such a big loop, both loop and BB vectorizer may > not perform well. There were several attempts at making the unroller guess less (that 1/3 reduction) but work out what actually will be simplified to be able to shrink those numbers. My favorite (but never implemented) idea was to code-generate optimistically but while running value-numbering on-the-fly on the code and cost the so simplified unrolled code, stopping when we reach a limit (and scrap the sofar accumulated code). While reasonably "easy" for unrolled code that ends up without branches it gets complicated for branches. My most recent attempt at improving was only for tracking what unrolling estimates as ending up constant. I think what might be the least controversical thing to do is to split the instruction limit between the early cunrolli and the late cunroll passes and lower the ones for cunrolli a lot.