[Bug tree-optimization/88760] GCC unrolling is suboptimal

wilco at gcc dot gnu.org Thu, 24 Jan 2019 06:18:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760


--- Comment #21 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #20)
> On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to rguent...@suse.de from comment #18)
> > 
> > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > > suggestion)
> > > 
> > > If that helps, sure (I'd have guessed uarchs are going to split
> > > load-multiple into separate loads, but eventually it avoids
> > > load-port contention?)
> > 
> > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> > 128-bit LDP in a single cycle (see Optimization Guide).
> > 
> > > > 2) Unrolling and breaking accumulator dependencies.
> > > 
> > > IIRC RTL unrolling can do this (as side-effect, not as main
> > > cost motivation) guarded with an extra switch.
> > > 
> > > > I think more general unrolling and the peeling associated with it can be
> > > > discussed independently of 1) and 2) once we collect more data on more
> > > > microarchitectures.
> > > 
> > > I think both of these can be "implemented" on the RTL unroller
> > > side.
> > 
> > You still need dependence analysis, alias info, ivopt to run again. The 
> > goal is
> > to remove the increment of the index, use efficient addressing modes 
> > (base+imm)
> > and allow scheduling to move instructions between iterations. I don't 
> > believe
> > the RTL unroller supports any of this today.
> 
> There's no way to encode load-multiple on GIMPLE that wouldn't be
> awkward to other GIMPLE optimizers.

I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem
useful. We don't even form LDP until quite late in RTL. The key to forming
LDP/STP is using base+imm addressing modes and having correct alias info (so
loads/stores from different iterations can be interleaved and then combined
into LDP/STP). The main thing a backend would need to do is tune address costs
to take future LDP formation into account (and yes, the existing cost models
need to be improved anyway).

> Yes, the RTL unroller supports scheduling (sched runs after unrolling)
> and scheduling can do dependence analysis.  Yes, the RTL unroller
> does _not_ do dependence analysis at the moment, so if you want to
> know beforehand whether you can interleave iterations you have to
> actually perform dependence analysis.  Of course you can do that
> on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
> at the moment).

Sure we *could* duplicate all high-level loop optimizations to work on RTL.
However is that worth the effort given we have them already at tree level?

> Note I'm not opposed to have IVOPTs on GIMPLE itself perform
> unrolling (I know Bin was against this given IVOPTs is already
> so comples) and a do accumulator breakup.  But I don't see how
> to do the load-multiple thing (yes, you could represent it
> as a vector load plus N element extracts on GIMPLE and it
> would be easy to teach SLP vectorization to perform this
> transform on its own if it is really profitable - which I
> doubt you can reasonably argue before RA, let alone on GIMPLE).

Let's forget about load-multiple in GIMPLE. Kyrill's example shows that
unrolling at the high level means the existing loop optimizations and analysis
work as expected and we end up with good addressing modes, LDPs and
interleaving of different iterations. With the existing RTL unroller this just
isn't happening.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

Reply via email to