4.5 Regression] 50% performance regression

matz at gcc dot gnu dot org Sun, 13 Dec 2009 15:48:34 -0800


------- Comment #25 from matz at gcc dot gnu dot org  2009-12-13 23:48 -------
The reason that the testcase still is slow (and that the inner loop isn't
unrolled or vectorized) is still the calculation of countm1.  The division
therein stays in the second inner loop, whereas with GCC 4.3 it can be moved
into the outer loop.  In this specific testcase it's a pass ordering problem:
we start with (at .vrp1) (only parts shown):


<bb 2>:
  D.1564_45 = *n_9(D);
  if (D.1564_45 > 1)
   ...
<bb 6>:
  D.1572_60 = *n_9(D);
  if (D.1572_60 > 0)
    goto <bb 7>;
  else
    goto <bb 8>;

Here _45 and _60 are equivalent, but VRP doesn't know this, hence it doesn't
detect the goto <bb 8> as dead.  The equivalence is only detected after PRE 
(not by PRE, though :-/ ), which means VRP2 does detect the jump as  dead,
and hence leaves only the step>0 case in the code.  But this is too late for
the late PRE (running before VRP2 and the loop optimizers) in order to move
the dependend division to the outer loop.

As the division isn't moved as loop invariant to the outer loop this also 
means that the loop count determination doesn't work, hence no unrolling.

But the slowness itself is due to the div instruction in the second loop,
instead of in the outer loop as with 4.3.


-- 

matz at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108

[Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression

Reply via email to