https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #20 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to Wilco from comment #12)
> There are 2 separate issues in the ARMv7 case. One is scheduling, the -S
> output goes down from 437 lines to 305 lines with -fno-schedule-insns (stack
> size 276 rather than 448 bytes). So basically the "register pressure aware"
> scheduler introduces lots of unnecessary spills.

This is kind-of expected in general, though almost certainly wrong in this
case.  The default "weighted" algorithm tended to overemphasise decreasing
spills (at the cost of decreasing ILP) and slowed down some important
benchmarks for which some spilling was better.  The "model" algorithm was
supposed to be a compromise.

I'll have a look to see whether there's an easy way of handling this case
better without regressing others.  (I'm not assigning myself since it's
unrelated to the x86 problem.)

> The 2nd issue is related to use of single-element operations within vectors.
> If I change the define to do an explicit dup, eg. vmulq_f32((b),
> vdupq_n_f32(a)), I get 211 lines and no spills at all. Switching scheduling
> on again gives 326 lines so it's spilling like crazy.

Yeah, the way arm_neon.h handles vmulq_n_f32 seems to leave lots of
uninitialised pseudo registers, which means that the VFP registers appear to
start the loop almost 2-times oversubscribed.  Do you know if we have a
separate PR for that?

Reply via email to