> > +/* Implement TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE. */ > > + > > +static int > > +ix86_ira_callee_saved_register_cost_scale (int) > > +{ > > + return 1; > > +} > > +
> > return cl; > > } > > +int > > +default_ira_callee_saved_register_cost_scale (int) > > +{ > > + return (optimize_size > > + ? 1 > > + : REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun))); > > +} > > + I am not sure how this makes sense - why x86 would be significantly different from other targets? I think the only bit non-standard thing is that prologue/epilogue code can use push/pop that is shorter than mov used to save caller saved registers. I went through few testcases: void d(); void a() { int b; asm ("use %0":"=r" (b)); d(); asm volatile (""::"r" (b)); } compiler with -O2 -fira-verbose=10000 gives: Popping a0(r99,l0) -- (0=12000,12000) (1=12000,12000) (2=12000,12000) (4=12000,12000) (5=12000,12000) (36=12000,12000) (37=12000,12000) (38=12000,12000) (39=12000,12000) (3=11000,11000) (6=11000,11000) (40=11000,11000) (41=11000,11000) (42=11000,11000) (43=11000,11000) load and save costs are 6. So spill pair is 12 weighted by 1000 that is REG_FREQ_MAX. Register 0 (EAX) has cost 12000 which makes sense to me: - load and save costs are 6, combined spill pair is 12 - REG_FREQ_MAX is 1000 and since function has only one BB, it has maximal frequency, so we get 12000. Register 3 (first caller saved) has cost 11000. This comes from: add_cost = ((ira_memory_move_cost[mode][rclass][0] + ira_memory_move_cost[mode][rclass][1]) * saved_nregs / hard_regno_nregs (hard_regno, mode) - 1) ^^ here * (optimize_size ? 1 : REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun))); There is no comment why -1, but I suppose it is there to biass costs to use prologue/epilogue instad of caller save sequence when runtime cost estimate is even. Now for void d(); void a() { for (int i = 0; i < 100; i++) d(); int b; asm ("use %0":"=r" (b)); d(); asm volatile (""::"r" (b)); } I get Popping a0(r100,l0) -- (0=120,120) (1=120,120) (2=120,120) (4=120,120) (5=120,120) (36=120,120) (37=120,120) (38=120,120) (39=120,120) (3=0,0) (6=110,110) (40=110,110) (41=110,110) (42=110,110) (43=110,110) This also makes sense to me, since there is loop the basic block has lower frequency of 10, thus costs are scaled down. void d(); int cnd; void a() { int b; asm ("use %0":"=r" (b)); if (__builtin_expect_with_probability (cnd, 1, 0.8)) d(); asm volatile (""::"r" (b)); } I get Popping a0(r100,l0) -- (0=9600,9600) (1=9600,9600) (2=9600,9600) (4=9600,9600) (5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) (39=9600,9600) (3=11000,11000) (6=11000,11000) (40=11000,11000) (41=11000,11000) (42=11000,11000) (43=11000,11000) which seems also correct. It is better to use caller saved registr since call to d() has lower frequency then the entry basic block. This is what gcc 13 and this patch gets wrong Popping a0(r100,l0) -- (1=9600,9600) (2=9600,9600) (4=9600,9600) (5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) (39=9600,9600) (3=11,11) (6=11,11) (40=11,11) (41=11,11) (42=11,11) (43=11,11) Due to missing scaling factor we think that using callee saved registr is win while it is not. GCC13 gets this wrong even for probability 0. Looking into PRs referneced in the patch: PR111673 is the original bug that motivated correcting the cost (adding the scale by entry block frequency) PR115932 is cris-elf I don't know how to bencmark easily. PR116028 seems to be about shrink wrapping in void f(int *i) { if (!i) return; else { __builtin_printf("Hi"); *i=0; } } here I see tha tthe cost model misses the fact that epilogue will be shrink-wrapped so both caller and callee saving will result in one spill after the early exit. PR117081 is about regression in povray. The reducted testcase: void foo (void); void bar (void); int test (int a) { int r; if (r = -a) foo (); else bar (); return r; } shows that we now use caller saved register (EAX) to hold the return value which yields longer code. The costs are Popping a0(r98,l0) -- (0=13000,13000) (3=15000,15000) (6=15000,15000) (40=15000,15000) (41=15000,15000) (42=15000,15000) (43=15000,15000) here 15000 is 11000+4000 where I think 4000 is cost of 2 reg-reg moves multiplied by REG_FREQ_MAX. This seems correct. GCC 13 uses callee saved register and produces: 0000000000000000 <test>: 0: 53 push %rbx <--- callee save 1: 89 fb mov %edi,%ebx <--- move 1 3: f7 db neg %ebx 5: 74 09 je 10 <test+0x10> 7: e8 00 00 00 00 call c <test+0xc> c: 89 d8 mov %ebx,%eax <--- callee restore e: 5b pop %rbx f: c3 ret 10: e8 00 00 00 00 call 15 <test+0x15> 15: 89 d8 mov %ebx,%eax <--- move 2 17: 5b pop %rbx <--- callee restore 18: c3 ret Mainline used EAX since it has costs 13000. It is not 100% clear to me why. - 12000 is the spilling (which is emitted twice but executed just once) - I would have expected 2000 for the move from edi to eax. However even if cost is 14000 we will choose EAX. The code is: 0: 89 f8 mov %edi,%eax <--- move1 2: 48 83 ec 18 sub $0x18,%rsp <--- stack frame creation 6: f7 d8 neg %eax 8: 89 44 24 0c mov %eax,0xc(%rsp) <--- spill out c: 85 ff test %edi,%edi e: 74 10 je 20 <test+0x20> 10: e8 00 00 00 00 call 15 <test+0x15> 15: 8b 44 24 0c mov 0xc(%rsp),%eax <--- spill in 19: 48 83 c4 18 add $0x18,%rsp <--- stack frame 1d: c3 ret 1e: 66 90 xchg %ax,%ax 20: e8 00 00 00 00 call 25 <test+0x25> 25: 8b 44 24 0c mov 0xc(%rsp),%eax <--- spill in 29: 48 83 c4 18 add $0x18,%rsp <--- stack frame 2d: c3 ret This sequence really saves one move at the expense of of stack frame allocation (which is not modelled by the cost model) and longer spill code (also no modelled). PR117082 is about noreturn function: __attribute__ ((noreturn)) void f3 (void) { int y0 = x0; int y1 = x1; f1 (); f2 (y0, y1); while (1); } Here the cost model is really wrong by assuming that entry and exit block have same frequencies. This can be fixed quite easilly (though it is a rare case) PR118497 seems to be ixed. So overall I think 1) we can fix scaling of epilogue by exit block frequency to get noreturns right. 2) we should drop the check for optimize_size. Since with -Os REG_FREQ_FROM_BB always returns 1000 everything should be scaled same way 3) we currently have wire in "-1" to biass the cost metric for callee saved registers. It may make sense to allow targets to control this, since i.e. x86 has push/pop that is shorter. -3 would solve the testcase with neg and would express that push/pop is still cheaper with extra reg-reg move. 4) cost model misses shring wrapping, the fact that if register is callee saved it may be used by multiple allocnos and also that push/pop sequence may avoid need for manual RSP adjustments. Those seems bit harder things to fit in though. So if we want to go with the target hook, I think it should adjust the cost before scalling (since targets may have special tricks for prologues) rather than the scale factor (which is target independent part of cost model). Honza