On 02/14/2013 03:28 AM, Vladimir Makarov wrote:
On 13-02-13 6:36 PM, Michael Eager wrote:

[snip]

I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence,
rather than causing a large number of additional loads.

Longer living pseudos has more probability to be spilled as they usually conflicts with more pseudos during their lives and spilling them frees a hard reg for many conflicting pseudos. That how RA heuristics work (in the old RA log (live range span) was used. The bigger the value, the more probability for spilling).
Perhaps the cost analysis has a problem.

I've checked it and it looks ok to me.
RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case).

It's both larger and slower, since the additional loads take much longer. I'll take a
look at -Os.

It looks like the values of p++ are being pre-calculated and stored on the stack. This results in
a load, rather than an increment of a register.

If it is so. It might be another optimization which moves p++ calculations higher. IRA does not do it (more correctly a new IRA feature implemented by Bernd Schmidt in gcc4.7 can move insns downwards in CFG graph to decrease reg pressure).

I checked all rtl passes these calcualtions are not created by any RTL pass. So it is probably some tree-ssa optimization.
Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance
improvements.

My impression is that the performance is worse. Reports I've seen are that the code is
substantially larger, which means more instructions.

I'm skeptical about comparisons between x86 and RISC processors. What works well
for one may not work well for the other.

IRA improved code for many RISC processors. Although tetter RA has smaller effect for these processors as they have more registers.
Looking at the test code, I can make some conclusions for myself:

o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex
rematerialization) and this pass could focus on them.

That might be useful.

o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had no live range splitting at all).

Each of the blocks of code is in it's own BB. I haven't checked, but I'd guess that most of the registers are in use on entry and still live on exit, so the
block has no registers to allocate.

Splitting in BB scope this case is not profitable.
I'd also recommend to try the following options concerning RA: -fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, -fira-region=one,all,mixed. Actually -fira-algorithm=priority + -fira-region=one is analog of what the old RA did.



I am reading this thread and getting more and more puzzled. The RA stuff is very complicated, having many constraints and many dependencies with other passes. Taking this into account, it seems that no heuristic algorithm can even get close to an optimal register allocation. A heuristic algorithm can't take all effects and side-effects into account
simultaneously.

Considering all that, why GCC does not use generic optimization algorithms for RA? A generic optimization can take all issues into account, simultaneously. I am talking about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] (Constraint Satisfaction Problem). There has been a lot of progress in these areas, and the
solvers are much faster (orders of magnitude) than they were 10 years ago.

So why ILP/SAT/CSP aren't used in RA? I don't think they will work much slower than what RA does today ( they will be somewhat slower but not much). I do believe that there is a good chance to get much better results from RA with these technologies.

I'd like to invest some time into a feasibility check, if someone is willing to work with me on modeling the RA problem (or at least several examples, such as the above).

Michael

[1] Even though CSP and SAT solving algorithms (systematic or stochastic) are not strictly optimization algorithms, they still can be used where optimization is needed. It is very difficult/time consuming to get an optimal solution with them, yet it is possible
      to get a good-enough solution in reasonable time.

Reply via email to