8 Regression] bad SIMD register allocation

rsandifo at gcc dot gnu.org Tue, 13 Mar 2018 15:27:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283


--- Comment #22 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to rsand...@gcc.gnu.org from comment #21)
> Created attachment 43646 [details]
> Patch to reduce spills for Armv7
> 
> (In reply to rsand...@gcc.gnu.org from comment #20)
> > (In reply to Wilco from comment #12)
> > > There are 2 separate issues in the ARMv7 case. One is scheduling, the -S
> > > output goes down from 437 lines to 305 lines with -fno-schedule-insns 
> > > (stack
> > > size 276 rather than 448 bytes). So basically the "register pressure 
> > > aware"
> > > scheduler introduces lots of unnecessary spills.
> > 
> > This is kind-of expected in general, though almost certainly wrong in this
> > case.  The default "weighted" algorithm tended to overemphasise decreasing
> > spills (at the cost of decreasing ILP) and slowed down some important
> > benchmarks for which some spilling was better.  The "model" algorithm was
> > supposed to be a compromise.
> > 
> > I'll have a look to see whether there's an easy way of handling this case
> > better without regressing others.  (I'm not assigning myself since it's
> > unrelated to the x86 problem.)
> 
> SCHED_PRESSURE_MODEL first tries to create a "model" schedule
> that keeps register down as far as possible and then uses that
> to guide the "real" schedule.  It looks like the model schedule
> goes catastrophically wrong in this case though: the original
> order had a VFP_REGS pressure of 56 (against a capacity of 64)
> while the model schedule had a pressure of 76(!).
> 
> I think the problem is that the algorithm was tuned on load/store
> style loops, where it was beneficial to keep the model schedule
> narrow and try to reach the eventual store (so killing off a
> whole chain).  It doesn't cope well with so many accumulators,
> where completing the chain never leads to a reduction in pressure.
> 
> The attached patch is a proof of concept that tries to handle
> this kind of situation better.  The model schedule now gives
> a VFP_REGS pressure of 52 instead of 76, which is 4 below the
> unscheduled code.  I'll try to give it more wider testing when
> I have time.
> 
> Although the patch removes some of the spills, the scheduler
> still thinks that it's better to keep others.  And in that
> sense it's working as intended, since as far as GCC's view
> of the pipeline is concerned, the spills give faster code.
> 
> This can be seen by grepping for "total time" in the sched2
> dumps, which include the effect of all the spill code.
> The times for the inner loop in this test are:
> 
> 307 cycles for the unpatched compiler (most spills)
> 355 cycles for the patched compiler (some spills)
> 398 cycles with -fno-schedule-insns (no spills)
> 
> These were all with "-mcpu=cortex-a15 -O2" but the
> results are similar with other -mcpu options.
> 
> So on GCC's own terms, using its model of the CPU,
> the current mega-spill code seems like a 25% win over
> the spill-free code.  That's probably not true in practice,
> but the scheduler can only work within the description
> it's given.

Sorry, forgot to say that all the above was with Wilco's
vdup_n_f32 modification, to work around the arm_neon.h problem.

[Bug middle-end/80283] [6/7/8 Regression] bad SIMD register allocation

Reply via email to