Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

Jan Hubicka Thu, 06 Feb 2025 14:35:53 -0800

> > +/* Implement TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE.  */
> > + 
> > +static int
> > +ix86_ira_callee_saved_register_cost_scale (int)
> > +{
> > +  return 1;
> > +}
> > +


> >     return cl;
> >   }
> > +int
> > +default_ira_callee_saved_register_cost_scale (int)
> > +{
> > +  return (optimize_size
> > +     ? 1
> > +     : REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
> > +}
> > +

I am not sure how this makes sense - why x86 would be significantly
different from other targets?

I think the only bit non-standard thing is that prologue/epilogue code
can use push/pop that is shorter than mov used to save caller saved
registers.

I went through few testcases:

void d();
void a()
{       
        int b;
        asm ("use %0":"=r" (b));
        d();
        asm volatile (""::"r" (b));
}
compiler with -O2 -fira-verbose=10000 gives:

Popping a0(r99,l0)  -- (0=12000,12000) (1=12000,12000) (2=12000,12000) 
(4=12000,12000) (5=12000,12000) (36=12000,12000) (37=12000,12000) 
(38=12000,12000) (39=12000,12000) (3=11000,11000) (6=11000,11000) 
(40=11000,11000) (41=11000,11000) (42=11000,11000) (43=11000,11000)

load and save costs are 6. So spill pair is 12 weighted by 1000 that is
REG_FREQ_MAX.

Register 0 (EAX) has cost 12000 which makes sense to me:
  - load and save costs are 6, combined spill pair is 12
  - REG_FREQ_MAX is 1000 and since function has only one BB, it has
    maximal frequency, so we get 12000.

Register 3 (first caller saved) has cost 11000.  This comes from:
            add_cost = ((ira_memory_move_cost[mode][rclass][0]
                         + ira_memory_move_cost[mode][rclass][1])
                        * saved_nregs / hard_regno_nregs (hard_regno,
                                                          mode) - 1)
                                                                  ^^
                                                                  here

                       * (optimize_size ? 1 :
                          REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));

There is no comment why -1, but I suppose it is there to biass costs to
use prologue/epilogue instad of caller save sequence when runtime cost
estimate is even.

Now for
void d();
void a()
{
        for (int i = 0; i < 100; i++)
          d();
        int b;
        asm ("use %0":"=r" (b));
        d();
        asm volatile (""::"r" (b));
}

I get

      Popping a0(r100,l0)  -- (0=120,120) (1=120,120) (2=120,120) (4=120,120) 
(5=120,120) (36=120,120) (37=120,120) (38=120,120) (39=120,120) (3=0,0) 
(6=110,110) (40=110,110) (41=110,110) (42=110,110) (43=110,110)

This also makes sense to me, since there is loop the basic block has
lower frequency of 10, thus costs are scaled down.

void d();
int cnd;
void a()
{
        int b;
        asm ("use %0":"=r" (b));

        if (__builtin_expect_with_probability (cnd, 1, 0.8))
          d();
        asm volatile (""::"r" (b));
}

I get

     Popping a0(r100,l0)  -- (0=9600,9600) (1=9600,9600) (2=9600,9600) 
(4=9600,9600) (5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) 
(39=9600,9600) (3=11000,11000) (6=11000,11000) (40=11000,11000) 
(41=11000,11000) (42=11000,11000) (43=11000,11000)

which seems also correct.  It is better to use caller saved registr
since call to d() has lower frequency then the entry basic block. This
is what gcc 13 and this patch gets wrong

     Popping a0(r100,l0)  -- (1=9600,9600) (2=9600,9600) (4=9600,9600) 
(5=9600,9600) (36=9600,9600) (37=9600,9600) (38=9600,9600) (39=9600,9600) 
(3=11,11) (6=11,11) (40=11,11) (41=11,11) (42=11,11) (43=11,11)

Due to missing scaling factor we think that using callee saved registr
is win while it is not.  GCC13 gets this wrong even for probability 0.

Looking into PRs referneced in the patch:
PR111673 is the original bug that motivated correcting the cost (adding
         the scale by entry block frequency)
PR115932 is cris-elf I don't know how to bencmark easily.
PR116028 seems to be about shrink wrapping in

void f(int *i)
{
        if (!i)
                return;
        else
        {
                __builtin_printf("Hi");
                *i=0;
        }
}

here I see tha tthe cost model misses the fact that epilogue will be
shrink-wrapped so both caller and callee saving will result in one spill
after the early exit.

PR117081 is about regression in povray. The reducted testcase:

void foo (void);
void bar (void);

int
test (int a)
{
  int r;

  if (r = -a)
    foo ();
  else
    bar ();

  return r;
}

shows that we now use caller saved register (EAX) to hold the return value 
which yields longer code.  The costs are
Popping a0(r98,l0)  -- (0=13000,13000) (3=15000,15000) (6=15000,15000) 
(40=15000,15000) (41=15000,15000) (42=15000,15000) (43=15000,15000)

here 15000 is 11000+4000 where I think 4000 is cost of 2 reg-reg moves
multiplied by REG_FREQ_MAX.   This seems correct. GCC 13 uses callee
saved register and produces:

0000000000000000 <test>:
   0:   53                      push   %rbx             <--- callee save
   1:   89 fb                   mov    %edi,%ebx        <--- move 1
   3:   f7 db                   neg    %ebx
   5:   74 09                   je     10 <test+0x10>
   7:   e8 00 00 00 00          call   c <test+0xc>
   c:   89 d8                   mov    %ebx,%eax        <--- callee restore
   e:   5b                      pop    %rbx
   f:   c3                      ret
  10:   e8 00 00 00 00          call   15 <test+0x15>
  15:   89 d8                   mov    %ebx,%eax        <--- move 2
  17:   5b                      pop    %rbx             <--- callee restore
  18:   c3                      ret

Mainline used EAX since it has costs 13000.  It is not 100% clear to me
why.
 - 12000 is the spilling (which is emitted twice but executed just once)
 - I would have expected 2000 for the move from edi to eax.
However even if cost is 14000 we will choose EAX.  The code is:

   0:   89 f8                   mov    %edi,%eax        <--- move1
   2:   48 83 ec 18             sub    $0x18,%rsp       <--- stack frame 
creation
   6:   f7 d8                   neg    %eax
   8:   89 44 24 0c             mov    %eax,0xc(%rsp)   <--- spill out
   c:   85 ff                   test   %edi,%edi
   e:   74 10                   je     20 <test+0x20>
  10:   e8 00 00 00 00          call   15 <test+0x15>
  15:   8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
  19:   48 83 c4 18             add    $0x18,%rsp       <--- stack frame
  1d:   c3                      ret
  1e:   66 90                   xchg   %ax,%ax
  20:   e8 00 00 00 00          call   25 <test+0x25>
  25:   8b 44 24 0c             mov    0xc(%rsp),%eax   <--- spill in
  29:   48 83 c4 18             add    $0x18,%rsp       <--- stack frame
  2d:   c3                      ret

This sequence really saves one move at the expense of of stack frame
allocation (which is not modelled by the cost model) and longer spill
code (also no modelled).

PR117082 is about noreturn function:
__attribute__ ((noreturn))
void
f3 (void)
{
  int y0 = x0;
  int y1 = x1;
  f1 ();
  f2 (y0, y1);
  while (1);
}

Here the cost model is really wrong by assuming that entry and exit
block have same frequencies.  This can be fixed quite easilly (though it
is a rare case)

PR118497  seems to be ixed.

So overall I think
 1) we can fix scaling of epilogue by exit block frequency
    to get noreturns right.
 2) we should drop the check for optimize_size.  Since with -Os
    REG_FREQ_FROM_BB always returns 1000 everything should be scaled
    same way
 3) we currently have wire in "-1" to biass the cost metric for callee
    saved registers.
    It may make sense to allow targets to control this, since i.e. x86
    has push/pop that is shorter. -3 would solve the testcase with neg
    and would express that push/pop is still cheaper with extra reg-reg
    move.
 4) cost model misses shring wrapping, the fact that if register is
    callee saved it may be used by multiple allocnos and also that
    push/pop sequence may avoid need for manual RSP adjustments.

    Those seems bit harder things to fit in though.

So if we want to go with the target hook, I think it should adjust the
cost before scalling (since targets may have special tricks for
prologues) rather than the scale factor (which is target independent
part of cost model).

Honza

Re: [PATCH v2] ira: Add a target hook for callee-saved register cost scale

Reply via email to