https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #19 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #18) > > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > > suggestion) > > > > If that helps, sure (I'd have guessed uarchs are going to split > > load-multiple into separate loads, but eventually it avoids > > load-port contention?) > > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a > 128-bit LDP in a single cycle (see Optimization Guide). > > > > 2) Unrolling and breaking accumulator dependencies. > > > > IIRC RTL unrolling can do this (as side-effect, not as main > > cost motivation) guarded with an extra switch. > > > > > I think more general unrolling and the peeling associated with it can be > > > discussed independently of 1) and 2) once we collect more data on more > > > microarchitectures. > > > > I think both of these can be "implemented" on the RTL unroller > > side. > > You still need dependence analysis, alias info, ivopt to run again. The goal > is > to remove the increment of the index, use efficient addressing modes > (base+imm) > and allow scheduling to move instructions between iterations. I don't believe > the RTL unroller supports any of this today. There's no way to encode load-multiple on GIMPLE that wouldn't be awkward to other GIMPLE optimizers. Yes, the RTL unroller supports scheduling (sched runs after unrolling) and scheduling can do dependence analysis. Yes, the RTL unroller does _not_ do dependence analysis at the moment, so if you want to know beforehand whether you can interleave iterations you have to actually perform dependence analysis. Of course you can do that on RTL. And of course you can do IVOPTs on RTL (yes, we don't do that at the moment). Note I'm not opposed to have IVOPTs on GIMPLE itself perform unrolling (I know Bin was against this given IVOPTs is already so comples) and a do accumulator breakup. But I don't see how to do the load-multiple thing (yes, you could represent it as a vector load plus N element extracts on GIMPLE and it would be easy to teach SLP vectorization to perform this transform on its own if it is really profitable - which I doubt you can reasonably argue before RA, let alone on GIMPLE).