Re: RFC: An alternative -fsched-pressure implementation

Richard Guenther Fri, 23 Dec 2011 04:12:49 -0800

On Fri, Dec 23, 2011 at 12:46 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> So it looks like two pieces of work related to scheduling and register
> pressure are being posted close together.  This one is an RFC for a less
> aggressive form of -fsched-pressure.  I think it should complement
> rather than compete with Bernd's IRA patch.  It seems like a good idea
> to take register pressure into account during the first scheduling pass,
> where we can still easily look at things like instruction latencies
> and pipeline utilisation.  Better rematerialisation in the register
> allocator would obviously be a good thing too though.
>
> This patch started when we (Linaro) saw a big drop in performance
> from vectorising an RGB to YIQ filter on ARM.  The first scheduling
> pass was overly aggressive in creating a "wide" schedule, and caused
> the newly-vectorised loop to contain lots of spills.  The loop grew so
> big that it even had a constant pool in the middle of it.
>
> -fsched-pressure did a very good job on this loop, creating far fewer
> spills and consequently avoiding the constant pool.  However, it seemed
> to make several other cases significantly worse.  The idea was therefore
> to try to come up with a form of -fsched-pressure that could be turned
> on for ARM by default.
>
> Current -fsched-pressure works by assigning an "excess (pressure) cost change"
> to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
> changes the way that the main list scheduler handles stalls due to data
> dependencies.  If an instruction would stall for N cycles, the scheduler
> would normally add it to the "now+N" queue, then add it to the ready queue
> after N cycles.  With -fsched-pressure, it instead adds the instruction
> to the ready queue immediately, while still recording that the instruction
> would require N stalls.  I'll write the number of stalls on X as delay(X).
>
> This arrangement allows the scheduler to choose between increasing register
> pressure and introducing a deliberate stall.  Instructions are ranked by:
>
>  (a) lowest ECC(X) + delay(X)
>  (b) lowest delay(X)
>  (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)
>
> Note that since delay(X) is measured in cycles, ECC(X) is effectively
> measured in cycles too.
>
> Several things seemed to be causing the degradations we were seeing
> with -fsched-pressure:
>
>  (1) The -fsched-pressure schedule is based purely on instruction latencies
>      and issue rate; it doesn't take the DFA into account.  This means that
>      we attempt to "dual issue" things like vector operations, loads and
>      stores on Cortex A8 and A9.  In the examples I looked at, these sorts
>      of inaccuracy seemed to accumulate, so that the value of delay(X)
>      became based on slightly unrealistic cycle times.
>
>      Note that this also affects code that doesn't have any pressure
>      problems; it isn't limited to code that does.
>
>      This may simply be historical.  It became much easier to use the
>      DFA here after Bernd's introduction of prune_ready_list, but the
>      original -fsched-pressure predates that.
>
>  (2) We calculate ECC(X) by walking the unscheduled part of the block
>      in its original order, then recording the pressure at each instruction.
>      This seemed to make ECC(X) quite sensitive to that original order.
>      I saw blocks that started out naturally "narrow" (not much ILP,
>      e.g. from unrolled loops) and others that started naturally "wide"
>      (a lot of ILP, such as in the libav h264 code), and walking the
>      block in order meant that the two styles would be handled differently.
>
>  (3) When calculating the pressure of the original block (as described
>      in (2)), we ignore the deaths of registers that are used by more
>      than one unscheduled instruction.  This tended to hurt long(ish)
>      loops in things like filters, where the same value is often used
>      as an input to two calculations.  The effect was that instructions
>      towards the end of the block would appear to have very high pressure.
>      This in turn made the algorithm very conservative; it wouldn't
>      promote instructions from later in the block because those
>      instructions seemed to have a prohibitively large cost.
>
>      I asked Vlad about this, and he confirmed that it was a deliberate
>      decision.  He'd tried honouring REG_DEAD notes instead, but it
>      produced worse results on x86.  I'll return to this at the end.
>
>  (4) ECC(X) is based on the pressure over and above ira_available_class_regs
>      (the number of allocatable registers in a given class).  ARM has 14
>      allocatable GENERAL_REGS: 16 minus the stack pointer and program
>      counter.  So if 14 integer variables are live across a loop but
>      not referenced within it, we end up scheduling that loop in a context
>      of permanent pressure.  Pressure becomes the overriding concern,
>      and we don't get much ILP.
>
>      I suppose there are at least two ways of viewing this:
>
>      (4a) We're giving an equal weight to:
>
>           (Ra) registers that are live across a loop but not used within it
>                (and which should therefore require no spill code in the
>                loop itself)
>
>           (Rb) registers that hold loop invariants (and so should only
>                require loads in the loop itself)
>
>           (Rc) registers that are used and set within the loop (and so would
>                require both loads and stores in the loop itself)
>
>           We effectively treat everything as (Rc).
>
>      (4b) We always work to ira_available_class_regs, rather than to
>           an estimate of what maximum pressure is realistically achievable.
>           If a block has something like:
>
>               A: R2 = *R1
>                  ...
>               B: R3 = *(R1 + 4)
>                  ...
>               C: R1 = R1 + 8
>                  ...
>               D: R4 = R2 + R3
>                  ...
>
>           then it can't help but have three registers live at once.
>
> The first thing I tried was to change just (4a) in isolation.
> Taking (Ra) out of the pressure estimation produced some benefits,
> but not enough.  I then tried a "staging" of costs, so that:
>
>  (Ra) had the cost of a load and store in the outer loop (if any)
>  (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
>  (Rc) had the cost of a load and store in the inner loop ( " " )
>
> These costs were still based on MEMORY_MOVE_COST, just like the current
> ones are.  However, MEMORY_MOVE_COST is defined to be relative to
> REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
> "simple" move.  Since ECC(X) is effectively a cycle count, it seemed
> better to divide the cost by 2.
>
> This again produced some benefit, but not enough.  I think part of the
> problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
> loads and stores cost the equivalent of 5 register moves, making the cost
> of a full spill equivalent to 10 register moves.  ARM also makes these
> costs independent of the mode, so that a single SImode load and store
> has the same cost as a 4-quad-vector load and store.  This might be
> something that's worth looking at, especially since the same costs
> influence spilling decisions in the register allocator.
>
> However, even reducing the cost of a load to 4 moves and a store to 2
> moves didn't bring enough benefit (although it was better than 5 and 5).
> The division of costs above is also a little artificial.  Because we don't
> have an integrated register allocator and scheduler, there's not really
> any telling how many times the block will need to load or store a value.
> An invariant might be used many times, and in such a way that several
> loads are needed, while another variable might be involved in a single
> read-modify-write sequence.  The true cost also depends on how easily
> the spill instructions fit into the surrounding code.
>
> On a different tack, I tried to tackle (1) by taking the DFA into account.
> If delay(X) is zero, but X cannot issue due to resource hazards, I set
> delay(X) to 1.  Again this wasn't enough in itself, although it does
> still form an important part of the "final" proposed version.
>
> I tried the algorithms from a few papers, but they too tended to be
> overly conservative or to suffer from degenerate behaviour in filters.
>
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b).  The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum.  This schedule ignores pipeline characteristics,
> latencies, and the number of available registers.  The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>
> During the main scheduling, an instruction X that occurs at or before
> the next point of maximum pressure in the model schedule is measured
> based on the current register pressure.  If X doesn't increase the
> current pressure beyond the current maximum, its ECC(X) is zero,
> otherwise ECC(X) is the cost of going from MP to the new maximum.
> The idea is that the final net pressure of scheduling a given set of
> instructions is going to be the same regardless of the order; we simply
> want to keep the intermediate pressure under control.  An ECC(X) of zero
> usually[*] means that scheduling X next won't send the rest of the
> sequence beyond the current maximum pressure.
>
>  [*] but not always.  There's more about this in the patch comments.
>
> If an instruction X occurs _after_ the next point of maximum pressure,
> X is measured based on that maximum pressure.  If the current maximum
> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
> going from MP to MP' + dP.
>
> Of course, this all depends on how good a value MP is, and therefore
> on how good the model schedule is.  I tried a few variations before
> settling on the one in the patch (which I hope makes conceptual sense).
>
> I initially stayed with the idea above about assigning different costs to
> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
> little too conservative in general, in that other tests were still worse
> with -fsched-pressure than without.  I described some of the problems
> with these costs above.  Another is that if an instruction X has a spill
> cost of 6, say, then:
>
>  ECC(X) + delay(X)
>
> will only allow X to be scheduled if the next instruction without
> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
> especially seeing as few ARM instructions have such a high latency.
> The benefit of spilling is often to avoid a series of short
> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>
> I then adjusted positive ECC(X) values based on the priority of X
> relative to the highest-priority zero-cost instruction.  This was
> better, but a DES filter in particular still suffered from the
> "lots of short stalls" problem.
>
> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
> and simply treating each extra spill as having a cost of 1.  Somewhat
> to my surprise, there was only one test in which some improvement was
> lost; that test was now only slightly better than without -fsched-pressure.
> But the fixed cost of 1 per spill achieved the goal of being as good as
> or better than -fno-sched-pressure in almost all cases, while still being
> significantly better in some of the cases we care about.
>
> Assigning a cost of just 1 to each excess spill is clearly going to
> interfere with the normal list scheduler less often than assigning each
> spill a higher cost.  Given that this appraoch also gives the most important
> benefits, it felt like a good default.
>
> All in all, the patch is a complicated way of being quite timid,
> but I hope it could also be used as a basis for more aggressive
> versions in future.  Things I haven't looked at include whether
> it would make sense to disable the priority-based adjustment of
> ECC for -Os.  (That's a question of whether this patch can improve
> size still further over -fno-sched-pressure.)
>
> Being timid ought to make it fit quite well with Bernd's patch,
> although I haven't had chance to try the two together.
>
> I talked with Vlad about this a couple of months ago (thanks again for
> the help, Vlad).  He explained that some of the things I was seeing --
> especially (3) -- were deliberate decisions that had improved the
> results for x86.  And I think it's probably true that the patch below
> is only going to work well on fairly regular RISCy machines.
>
> For that reason, the patch adds an -fsched-pressure-algorithm=
> option to choose between the two versions.  For historical reasons
> I called the current algorithm "weighted" and the proposed one "model",
> but I'm sure there are better names.  I've kept the option undocumented
> because it's really an internal thing.  "weighted" remains the default.
>
> Of course, adding an alternative like this would only be acceptable if
> -fsched-pressure and -fsched-pressure-algorithm=model became the default
> for ARM (or another target).  That's a decision for the ARM maintainers;
> I've sent Ramana the results I got, which unfortunately I can't share
> more widely.
>
> If anyone does have time to test this on their favourite port,
> I'd really appreciate it.  I already know that it doesn't perform
> well on SPECFP for rs6000 because of the choice of pressure classes;
> I'll send a separate note about that so that it doesn't get lost in
> this long essay.
>
> Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
> are enabled.  I'm planning to send a patch for that at some point;
> it's a regression from 4.6, so seems like stage 3 material.
> In the meantime, a simple workaround is to stick "return;"
> at the beginning of the function.


Without looking at the patch itself I think that addressing (2) independent
on the sched-pressure-algorithm would be a good idea.  That said,
even on x86_64 -fsched-pressure isn't an overall win - is it enabled by
default on any architecture?  I too think a magic constant of '1' for each
spill is good (though maybe we'd want to have a target hook that
specifies spill cost [for a specific mode] in terms of cycles).

Richard.

> Richard
>
>

Re: RFC: An alternative -fsched-pressure implementation

Reply via email to