https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283
--- Comment #22 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- (In reply to rsand...@gcc.gnu.org from comment #21) > Created attachment 43646 [details] > Patch to reduce spills for Armv7 > > (In reply to rsand...@gcc.gnu.org from comment #20) > > (In reply to Wilco from comment #12) > > > There are 2 separate issues in the ARMv7 case. One is scheduling, the -S > > > output goes down from 437 lines to 305 lines with -fno-schedule-insns > > > (stack > > > size 276 rather than 448 bytes). So basically the "register pressure > > > aware" > > > scheduler introduces lots of unnecessary spills. > > > > This is kind-of expected in general, though almost certainly wrong in this > > case. The default "weighted" algorithm tended to overemphasise decreasing > > spills (at the cost of decreasing ILP) and slowed down some important > > benchmarks for which some spilling was better. The "model" algorithm was > > supposed to be a compromise. > > > > I'll have a look to see whether there's an easy way of handling this case > > better without regressing others. (I'm not assigning myself since it's > > unrelated to the x86 problem.) > > SCHED_PRESSURE_MODEL first tries to create a "model" schedule > that keeps register down as far as possible and then uses that > to guide the "real" schedule. It looks like the model schedule > goes catastrophically wrong in this case though: the original > order had a VFP_REGS pressure of 56 (against a capacity of 64) > while the model schedule had a pressure of 76(!). > > I think the problem is that the algorithm was tuned on load/store > style loops, where it was beneficial to keep the model schedule > narrow and try to reach the eventual store (so killing off a > whole chain). It doesn't cope well with so many accumulators, > where completing the chain never leads to a reduction in pressure. > > The attached patch is a proof of concept that tries to handle > this kind of situation better. The model schedule now gives > a VFP_REGS pressure of 52 instead of 76, which is 4 below the > unscheduled code. I'll try to give it more wider testing when > I have time. > > Although the patch removes some of the spills, the scheduler > still thinks that it's better to keep others. And in that > sense it's working as intended, since as far as GCC's view > of the pipeline is concerned, the spills give faster code. > > This can be seen by grepping for "total time" in the sched2 > dumps, which include the effect of all the spill code. > The times for the inner loop in this test are: > > 307 cycles for the unpatched compiler (most spills) > 355 cycles for the patched compiler (some spills) > 398 cycles with -fno-schedule-insns (no spills) > > These were all with "-mcpu=cortex-a15 -O2" but the > results are similar with other -mcpu options. > > So on GCC's own terms, using its model of the CPU, > the current mega-spill code seems like a 25% win over > the spill-free code. That's probably not true in practice, > but the scheduler can only work within the description > it's given. Sorry, forgot to say that all the above was with Wilco's vdup_n_f32 modification, to work around the arm_neon.h problem.