https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65862

--- Comment #5 from Robert Suchanek <robert.suchanek at imgtec dot com> ---
Sorry for late reply, I was on vacation.

> The costs are equal if cost of moving general regs to/from fp regs or
> memory are equal.  So it looks ok to me.
> 
> r218 spilled in IRA is reassigned to a fp reg in *LRA*.  

> But I could try to use preferred class in LRA (after checking how it
> affects x86/x86-64 performance), if such solution is ok for you.

Indeed, the above test case only shows the problem in LRA. If the preferred
class would be the winner then why not. However, there are still some issues
with IRA and I have another testcase to show it.

> I am not sure, that the result code is better as we access memory 3
> times instead of access to $f20.

On one hand, yes, it seems good but it's not always desirable to use FP regs
until absolutely necessary. For instance, compiling the dynamic linker that
uses FP regs does not seem to be right.

I had another thought about spilling into registers and how we could guarantee
spilling into the desirable class. In the majority of cases where integers end
up
in floating-point registers, I see the following in the dumps:
...
        Reassigning non-reload pseudos
                 Assign 52 to r217 (freq=46)
...

This introduced the use of FP registers (in lra-assigns.c):
...
if (n != 0 && lra_dump_file != NULL)                                
   fprintf (lra_dump_file, "  Reassigning non-reload pseudos\n");    
 qsort (sorted_pseudos, n, sizeof (int), pseudo_compare_func);       
 for (i = 0; i < n; i++)                                             
   {                                                                 
     regno = sorted_pseudos[i];                                      
     hard_regno = find_hard_regno_for (regno, &cost, -1, false);     
     if (hard_regno >= 0)                        
       ...                                                            
     else                                                            
       ...                                                                      
   }      
...

find_hard_regno_for chooses the FP registers freely because of allocno class
has ALL_REGS.

With a quick hack in the if conditional to skip the body for pseudos spilled to
memory:

        ...
        if (hard_regno >= 0 && ! in_mem_p (regno))     
        ...

forces the use of the TARGET_SPILL_CLASS hook and resolves spilling to FP regs
in over 95% cases but not entirely. In terms of the code size, this change had
a minor improvement on average case. Would this approach be the correct way to
guarantee spilling to the desired class?

In the remaining 5% cases, IRA assigns FP regs with LRA blindly following IRA's
decisions like in the following reduced case:

int a, b, d, e, j, k, n, o;
unsigned c, h, i, l, m, p;
int *f;
int *g;
int fn1(int p1) { return p1 - a; }

int fn2() {
  b = b + 1 - a;
  e = 1 + o + 1518500249;
  d = d + n;
  c = (int)c + g[0];
  b = b + m + 1;
  d = d + p + 1518500249;
  d = d + k - 1;
  c = fn1(c + j + 1518500249);
  e = fn1(e + i + 1);
  d = d + h + 1859775393 - a;
  c = fn1(c + (d ^ 1 ^ b) + g[1] + 1);
  b = fn1(b + m + 3);
  d = fn1(d + l + 1);
  b = b + (c ^ 1) + p + 1;
  e = fn1(e + (b ^ c ^ d) + n + 1);
  d = o;
  b = 0;
  e = e + k + 1859775393;
  f[0] = e;
}

I'm not sure how this could be fixed in LRA and again this is related to
ALL_REGS for allocnos. Perhaps changing the class for reloads to the spill
class in LRA would do the trick but it may have other problems.
My last attempt was to increase the cost of FP_REGS in IRA for integral modes
(similar effect to increasing the costs of moving FP<>GR in the backend) but
the cost pass looks complicated and I'm not entirely sure where to tweak it.
Any suggestions/ideas?

> I tried reverting the ALL_REGS patch and I don't see any regressions - in
> fact allocations are slightly better (fewer registers with ALL_REGS
> preference which is what we need - a strong decision to allocate to either
> FP or int regs). So what was the motivation for it?

AFAICS, the aim was to fix the code generation regression for x86. x86 doesn't
seem to be as much affected as others. I did not notice code size differences
with -O2 and default arch for x86_64-unknown-linux-gnu triplet and CSiBE
benchmark, -Os showed some minor improvements/regression with the largest
difference in mpeg2dec-0.3.1 yielding ~0.3% improvement. I haven't evaluated
performance changes.

For MIPS, I also saw allocation improvements, more erratic than x86 with
improvement about 0.5% on average. Reverting the patch does bring the old issue
back but I wonder what is the impact of it and whether it is a justifiable fix
to the extent it outweights the disadvantages. Or maybe the original problem
could be fixed differently?

Reply via email to