On Fri, Dec 23, 2011 at 12:46 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > So it looks like two pieces of work related to scheduling and register > pressure are being posted close together. This one is an RFC for a less > aggressive form of -fsched-pressure. I think it should complement > rather than compete with Bernd's IRA patch. It seems like a good idea > to take register pressure into account during the first scheduling pass, > where we can still easily look at things like instruction latencies > and pipeline utilisation. Better rematerialisation in the register > allocator would obviously be a good thing too though. > > This patch started when we (Linaro) saw a big drop in performance > from vectorising an RGB to YIQ filter on ARM. The first scheduling > pass was overly aggressive in creating a "wide" schedule, and caused > the newly-vectorised loop to contain lots of spills. The loop grew so > big that it even had a constant pool in the middle of it. > > -fsched-pressure did a very good job on this loop, creating far fewer > spills and consequently avoiding the constant pool. However, it seemed > to make several other cases significantly worse. The idea was therefore > to try to come up with a form of -fsched-pressure that could be turned > on for ARM by default. > > Current -fsched-pressure works by assigning an "excess (pressure) cost change" > to each instruction; here I'll write that as ECC(X). -fsched-pressure also > changes the way that the main list scheduler handles stalls due to data > dependencies. If an instruction would stall for N cycles, the scheduler > would normally add it to the "now+N" queue, then add it to the ready queue > after N cycles. With -fsched-pressure, it instead adds the instruction > to the ready queue immediately, while still recording that the instruction > would require N stalls. I'll write the number of stalls on X as delay(X). > > This arrangement allows the scheduler to choose between increasing register > pressure and introducing a deliberate stall. Instructions are ranked by: > > (a) lowest ECC(X) + delay(X) > (b) lowest delay(X) > (c) normal list-scheduler ranking (based on things like INSN_PRIORITY) > > Note that since delay(X) is measured in cycles, ECC(X) is effectively > measured in cycles too. > > Several things seemed to be causing the degradations we were seeing > with -fsched-pressure: > > (1) The -fsched-pressure schedule is based purely on instruction latencies > and issue rate; it doesn't take the DFA into account. This means that > we attempt to "dual issue" things like vector operations, loads and > stores on Cortex A8 and A9. In the examples I looked at, these sorts > of inaccuracy seemed to accumulate, so that the value of delay(X) > became based on slightly unrealistic cycle times. > > Note that this also affects code that doesn't have any pressure > problems; it isn't limited to code that does. > > This may simply be historical. It became much easier to use the > DFA here after Bernd's introduction of prune_ready_list, but the > original -fsched-pressure predates that. > > (2) We calculate ECC(X) by walking the unscheduled part of the block > in its original order, then recording the pressure at each instruction. > This seemed to make ECC(X) quite sensitive to that original order. > I saw blocks that started out naturally "narrow" (not much ILP, > e.g. from unrolled loops) and others that started naturally "wide" > (a lot of ILP, such as in the libav h264 code), and walking the > block in order meant that the two styles would be handled differently. > > (3) When calculating the pressure of the original block (as described > in (2)), we ignore the deaths of registers that are used by more > than one unscheduled instruction. This tended to hurt long(ish) > loops in things like filters, where the same value is often used > as an input to two calculations. The effect was that instructions > towards the end of the block would appear to have very high pressure. > This in turn made the algorithm very conservative; it wouldn't > promote instructions from later in the block because those > instructions seemed to have a prohibitively large cost. > > I asked Vlad about this, and he confirmed that it was a deliberate > decision. He'd tried honouring REG_DEAD notes instead, but it > produced worse results on x86. I'll return to this at the end. > > (4) ECC(X) is based on the pressure over and above ira_available_class_regs > (the number of allocatable registers in a given class). ARM has 14 > allocatable GENERAL_REGS: 16 minus the stack pointer and program > counter. So if 14 integer variables are live across a loop but > not referenced within it, we end up scheduling that loop in a context > of permanent pressure. Pressure becomes the overriding concern, > and we don't get much ILP. > > I suppose there are at least two ways of viewing this: > > (4a) We're giving an equal weight to: > > (Ra) registers that are live across a loop but not used within it > (and which should therefore require no spill code in the > loop itself) > > (Rb) registers that hold loop invariants (and so should only > require loads in the loop itself) > > (Rc) registers that are used and set within the loop (and so would > require both loads and stores in the loop itself) > > We effectively treat everything as (Rc). > > (4b) We always work to ira_available_class_regs, rather than to > an estimate of what maximum pressure is realistically achievable. > If a block has something like: > > A: R2 = *R1 > ... > B: R3 = *(R1 + 4) > ... > C: R1 = R1 + 8 > ... > D: R4 = R2 + R3 > ... > > then it can't help but have three registers live at once. > > The first thing I tried was to change just (4a) in isolation. > Taking (Ra) out of the pressure estimation produced some benefits, > but not enough. I then tried a "staging" of costs, so that: > > (Ra) had the cost of a load and store in the outer loop (if any) > (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger) > (Rc) had the cost of a load and store in the inner loop ( " " ) > > These costs were still based on MEMORY_MOVE_COST, just like the current > ones are. However, MEMORY_MOVE_COST is defined to be relative to > REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a > "simple" move. Since ECC(X) is effectively a cycle count, it seemed > better to divide the cost by 2. > > This again produced some benefit, but not enough. I think part of the > problem is that ARM's MEMORY_MOVE_COST is very high: it says that both > loads and stores cost the equivalent of 5 register moves, making the cost > of a full spill equivalent to 10 register moves. ARM also makes these > costs independent of the mode, so that a single SImode load and store > has the same cost as a 4-quad-vector load and store. This might be > something that's worth looking at, especially since the same costs > influence spilling decisions in the register allocator. > > However, even reducing the cost of a load to 4 moves and a store to 2 > moves didn't bring enough benefit (although it was better than 5 and 5). > The division of costs above is also a little artificial. Because we don't > have an integrated register allocator and scheduler, there's not really > any telling how many times the block will need to load or store a value. > An invariant might be used many times, and in such a way that several > loads are needed, while another variable might be involved in a single > read-modify-write sequence. The true cost also depends on how easily > the spill instructions fit into the surrounding code. > > On a different tack, I tried to tackle (1) by taking the DFA into account. > If delay(X) is zero, but X cannot issue due to resource hazards, I set > delay(X) to 1. Again this wasn't enough in itself, although it does > still form an important part of the "final" proposed version. > > I tried the algorithms from a few papers, but they too tended to be > overly conservative or to suffer from degenerate behaviour in filters. > > In the end I tried an ad-hoc approach in an attempt to do something > about (2), (3) and (4b). The idea was to construct a preliminary > "model" schedule in which the only objective is to keep register > pressure to a minimum. This schedule ignores pipeline characteristics, > latencies, and the number of available registers. The maximum pressure > seen in this initial model schedule (MP) is then the benchmark for ECC(X). > > During the main scheduling, an instruction X that occurs at or before > the next point of maximum pressure in the model schedule is measured > based on the current register pressure. If X doesn't increase the > current pressure beyond the current maximum, its ECC(X) is zero, > otherwise ECC(X) is the cost of going from MP to the new maximum. > The idea is that the final net pressure of scheduling a given set of > instructions is going to be the same regardless of the order; we simply > want to keep the intermediate pressure under control. An ECC(X) of zero > usually[*] means that scheduling X next won't send the rest of the > sequence beyond the current maximum pressure. > > [*] but not always. There's more about this in the patch comments. > > If an instruction X occurs _after_ the next point of maximum pressure, > X is measured based on that maximum pressure. If the current maximum > pressure is MP', and X increases pressure by dP, ECC(X) is the cost of > going from MP to MP' + dP. > > Of course, this all depends on how good a value MP is, and therefore > on how good the model schedule is. I tried a few variations before > settling on the one in the patch (which I hope makes conceptual sense). > > I initially stayed with the idea above about assigning different costs to > (Ra), (Rb) and (Rc). This produces some good results, but was still a > little too conservative in general, in that other tests were still worse > with -fsched-pressure than without. I described some of the problems > with these costs above. Another is that if an instruction X has a spill > cost of 6, say, then: > > ECC(X) + delay(X) > > will only allow X to be scheduled if the next instruction without > a spill cost has a delay of 6 cycles or more. This is overly harsh, > especially seeing as few ARM instructions have such a high latency. > The benefit of spilling is often to avoid a series of short > (e.g. single-cycle) stalls, rather than to avoid a single long one. > > I then adjusted positive ECC(X) values based on the priority of X > relative to the highest-priority zero-cost instruction. This was > better, but a DES filter in particular still suffered from the > "lots of short stalls" problem. > > Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether > and simply treating each extra spill as having a cost of 1. Somewhat > to my surprise, there was only one test in which some improvement was > lost; that test was now only slightly better than without -fsched-pressure. > But the fixed cost of 1 per spill achieved the goal of being as good as > or better than -fno-sched-pressure in almost all cases, while still being > significantly better in some of the cases we care about. > > Assigning a cost of just 1 to each excess spill is clearly going to > interfere with the normal list scheduler less often than assigning each > spill a higher cost. Given that this appraoch also gives the most important > benefits, it felt like a good default. > > All in all, the patch is a complicated way of being quite timid, > but I hope it could also be used as a basis for more aggressive > versions in future. Things I haven't looked at include whether > it would make sense to disable the priority-based adjustment of > ECC for -Os. (That's a question of whether this patch can improve > size still further over -fno-sched-pressure.) > > Being timid ought to make it fit quite well with Bernd's patch, > although I haven't had chance to try the two together. > > I talked with Vlad about this a couple of months ago (thanks again for > the help, Vlad). He explained that some of the things I was seeing -- > especially (3) -- were deliberate decisions that had improved the > results for x86. And I think it's probably true that the patch below > is only going to work well on fairly regular RISCy machines. > > For that reason, the patch adds an -fsched-pressure-algorithm= > option to choose between the two versions. For historical reasons > I called the current algorithm "weighted" and the proposed one "model", > but I'm sure there are better names. I've kept the option undocumented > because it's really an internal thing. "weighted" remains the default. > > Of course, adding an alternative like this would only be acceptable if > -fsched-pressure and -fsched-pressure-algorithm=model became the default > for ARM (or another target). That's a decision for the ARM maintainers; > I've sent Ramana the results I got, which unfortunately I can't share > more widely. > > If anyone does have time to test this on their favourite port, > I'd really appreciate it. I already know that it doesn't perform > well on SPECFP for rs6000 because of the choice of pressure classes; > I'll send a separate note about that so that it doesn't get lost in > this long essay. > > Also, print_pseudo_costs segfaults when -fsched-pressure and dumping > are enabled. I'm planning to send a patch for that at some point; > it's a regression from 4.6, so seems like stage 3 material. > In the meantime, a simple workaround is to stick "return;" > at the beginning of the function.
Without looking at the patch itself I think that addressing (2) independent on the sched-pressure-algorithm would be a good idea. That said, even on x86_64 -fsched-pressure isn't an overall win - is it enabled by default on any architecture? I too think a magic constant of '1' for each spill is good (though maybe we'd want to have a target hook that specifies spill cost [for a specific mode] in terms of cycles). Richard. > Richard > >