https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #19 from Wilco <wilco at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #18) > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > suggestion) > > If that helps, sure (I'd have guessed uarchs are going to split > load-multiple into separate loads, but eventually it avoids > load-port contention?) Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a 128-bit LDP in a single cycle (see Optimization Guide). > > 2) Unrolling and breaking accumulator dependencies. > > IIRC RTL unrolling can do this (as side-effect, not as main > cost motivation) guarded with an extra switch. > > > I think more general unrolling and the peeling associated with it can be > > discussed independently of 1) and 2) once we collect more data on more > > microarchitectures. > > I think both of these can be "implemented" on the RTL unroller > side. You still need dependence analysis, alias info, ivopt to run again. The goal is to remove the increment of the index, use efficient addressing modes (base+imm) and allow scheduling to move instructions between iterations. I don't believe the RTL unroller supports any of this today.