[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

rguenther at suse dot de via Gcc-bugs Mon, 26 Feb 2024 23:53:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to Hongtao Liu from comment #9)
> > The original case is a little different from the one in PR.
> But the issue is similar, after cunrolli, GCC failed to vectorize the outer
> loop.
> 
> The interesting thing is in estimated_unrolled_size, the original unr_insns is
> 288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
> is decreased by 1/3 due to
> 
>    Loop body is likely going to simplify further, this is difficult
>    to guess, we just decrease the result by 1/3.  */
> 
> In practice, this loop body is not simplied for 1/3 of the instructions.
> 
> Considering the unroll factor is 16, the unr_insn is large(192), I was
> wondering if we could add some heuristic algorithm to avoid complete loop
> unroll, because usually for such a big loop, both loop and BB vectorizer may
> not perform well.

There were several attempts at making the unroller guess less (that 1/3
reduction) but work out what actually will be simplified to be able to
shrink those numbers.

My favorite (but never implemented) idea was to code-generate 
optimistically but while running value-numbering on-the-fly on the
code and cost the so simplified unrolled code, stopping when we
reach a limit (and scrap the sofar accumulated code).  While
reasonably "easy" for unrolled code that ends up without branches
it gets complicated for branches.

My most recent attempt at improving was only for tracking what
unrolling estimates as ending up constant.

I think what might be the least controversical thing to do is to split
the instruction limit between the early cunrolli and the late cunroll
passes and lower the ones for cunrolli a lot.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

Reply via email to