[Bug tree-optimization/88760] GCC unrolling is suboptimal

wilco at gcc dot gnu.org Thu, 24 Jan 2019 05:28:04 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760


--- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #18)

> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
> 
> If that helps, sure (I'd have guessed uarchs are going to split
> load-multiple into separate loads, but eventually it avoids
> load-port contention?)

Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
128-bit LDP in a single cycle (see Optimization Guide).

> > 2) Unrolling and breaking accumulator dependencies.
> 
> IIRC RTL unrolling can do this (as side-effect, not as main
> cost motivation) guarded with an extra switch.
> 
> > I think more general unrolling and the peeling associated with it can be
> > discussed independently of 1) and 2) once we collect more data on more
> > microarchitectures.
> 
> I think both of these can be "implemented" on the RTL unroller
> side.

You still need dependence analysis, alias info, ivopt to run again. The goal is
to remove the increment of the index, use efficient addressing modes (base+imm)
and allow scheduling to move instructions between iterations. I don't believe
the RTL unroller supports any of this today.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

Reply via email to