https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782
--- Comment #18 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #17) > > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the > > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration > > count. It shouldn't grow or shrink based on the value of SPILLED. > > That's because every execution of the loop body involves exactly one > > reference to r104. SPILLED specifies the probability that that single > > reference is the “call” use rather than the “non-call” use, but it doesn't > > change the total number of references per iteration. > > > > So I think the only reason we see the different ALLOCNO_FREQs in: > > > > ALLOCNO_FREQ 989, … > > > > vs: > > > > ALLOCNO_FREQ 990, … > > > > is round-off error. If the values had more precision, I think we'd > > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ. > > yeah, that's plausible, as far as I can tell the FREQ are always scaled by > REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an > integer division. The general problem is that the IPA frequences don't > really seem to have any bounded range and so it always needs to scale. > > So I think you're always going to have this error one way or another > which may or may not work to your advantage on any given program. > > Maybe we need a way to be a bit more tolerant of this rounding error > instead? I guess the thing is: with the costs that the target is giving IRA, there isn't much of a difference between the cost of the “0.5 way” and the cost of the “0.51 way” for values of SPILLED around the 0.5 mark. For spilling r104 we have: Factor Cost type What ====== ========= ==== 1 store store outside the loop when R is set 1024×SPILLED load load inside the loop when R is used (call) 1024×(1-SPILLED) load load inside the loop when R is used (non-call) For allocating a call-clobbered register to r104 we have: Factor Cost type What ====== ========= ==== 1024×SPILLED store store before the call 1024×SPILLED load load after the call When loads and store have equal weight, that weight cancels out, giving: 1025 vs. 2048×SPILLED So the costs are equal for SPILLED=1025/2048. Above that spilling is cheaper, below that allocating a call-clobbered register is cheaper. IRA has done well to find the exact tipping point (based on the assumptions it has been told to use), but for values of SPILLED around the tipping point, the two costs are very close. So with the current cost model, we're not falling off a cliff cost-wise when we hit the tipping point or rounding error. The costs coming in give the impression that there isn't much to chose between the two approaches when SPILLED is roughly half-and-half. Now if SPILLED is completely off the mark, e.g. we say it has a probability of 0.51 but the actual runtime probability is more like 0.1, then clearly we're in trouble. I'm not sure what we can do about that though. Perhaps if the edge frequencies (and derived block frequencies) are pure guesses, we should weight based on the approximate cost of the block instead. I.e. the call block gets a low frequency because calls are expensive, while the non-call block gets a high frequency because it does simple arithmetic. Thus any new instructions added to the call block are likely to have a proportionately lower effect than any new instructions added to the non-call block. I hope we don't have to do that though :-) > > > > which is cheaper than both the current approaches. We don't do that > > > > optimisation yet though, so the current costing seems to reflect what we > > > > currently generate. > > > > > > In many (if not most) Arches stores are significantly cheaper than the > > > loads > > > though. So the store before the call doesn't end up making that much of a > > > difference, but yes it adds up if you have many of them. > > Yeah. Could we fix the problem that way instead? The only reason IRA is > > treating loads and stores as equal cost is because aarch64 asked it to :-) > > I tried a quick check and it does fix the testcase but not the benchmark. > which > is not entirely unexpected thinking about it because x86 does correctly > model the > store costs. Yeah, was afraid that would be the case. > I can try fixing the costs correctly and try reducing again. It looks like > it still > thinks spilling to memory is cheaper than caller saves reloads. Thanks. I think that would help. The current aarch64 costs are likely to distort things if we try to reduce using them. IMO the initial reduction has been useful because it gives a nice crystallised example of why the load/store cost ratio matters.