Hi Vlad, I added you directly in case you hadn't spotted my original post.
A simple example for AArch64 trunk is as follows: // Compile with: -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 -ffixed-d15 -f(no-)caller-saves void g(void); float f(float x) { x += 3.0; g(); x *= 3.0; return x; } It seems that reload only ever considers rematerialization of spilled liveranges, not caller-saved ones. That means the caller-save code should either reject constants outright or the memory spill cost for these should always be lower than that of a caller-save (given memory_move_cost=4 and register_move_cost=2 as commonly used by targets, anything that can be rematerialized should have less than half the cost of being spilled or caller-saved). Wilco > -----Original Message----- > From: Wilco Dijkstra [mailto:wdijk...@arm.com] > Sent: 27 August 2014 17:25 > To: 'gcc@gcc.gnu.org' > Subject: Register allocation: caller-save vs spilling > > Hi, > > I'm investigating various register allocation inefficiencies. The first thing > that stands out > is that GCC both supports caller-saves as well as spilling. Spilling seems to > spill all > definitions and all uses of a liverange. This means you often end up with > multiple reloads > close together, while it would be more efficient to do a single load and then > reuse the loaded > value several times. Caller-save does better in that case, but it is > inefficient in that it > repeatedly stores registers across every call even if unchanged. If both were > fixed to > minimise the number of loads/stores I can't see how one could beat the other, > so you'd no > longer need both. > > Anyway due to the current implementation there are clearly cases where > caller-save is best and > cases where spilling is best. However I do not see it making the correct > decision despite > trying to account for the costs - some code is significantly faster with > -fno-caller-saves, > other code wins with -fcaller-saves. As an example, I see code like this on > AArch64: > > ldr s4, .LC20 > fmul s0, s0, s4 > str s4, [x29, 104] > bl f > ldr s4, [x29, 104] > fmul s0, s0, s4 > > With -fno-caller-saves it spills and rematerializes the constant as you'd > expect: > > ldr s1, .LC20 > fmul s0, s0, s1 > bl f > ldr s5, .LC20 > fmul s0, s0, s5 > > So given this, is the cost calculation correct and does it include > rematerialization? The > spill code understands how to rematerialize so it should take this into > account in the costs. > I did find some code in ira-costs.c in scan_one_insn() that attempts > something that looks like > an adjustment for rematerialization but it doesn't appear to handle all cases > (simple > immediates, 2-instruction immediates, address-constants, non-aliased loads > such as literal > pool and const data loads). > > Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall > performance > improves significantly if I add this (basically the default heuristic used on > instruction > frequencies): > > --- a/gcc/ira-costs.c > +++ b/gcc/ira-costs.c > @@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void) > * ALLOCNO_FREQ (a) > * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2); > #endif > + if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a)) > + cost = INT_MAX; > } > if (INT_MAX - cost < reg_costs[j]) > reg_costs[j] = INT_MAX; > > If such a simple heuristic can beat the costs, they can't be quite right. Note if (ALLOCNO_FREQ (a) < 2 * ALLOCNO_CALL_FREQ (a)) turns out to be best overall. > Is there anyone who understands the cost calculations? > > Wilco