------- Comment #25 from matz at gcc dot gnu dot org 2009-12-13 23:48 ------- The reason that the testcase still is slow (and that the inner loop isn't unrolled or vectorized) is still the calculation of countm1. The division therein stays in the second inner loop, whereas with GCC 4.3 it can be moved into the outer loop. In this specific testcase it's a pass ordering problem: we start with (at .vrp1) (only parts shown):
<bb 2>: D.1564_45 = *n_9(D); if (D.1564_45 > 1) ... <bb 6>: D.1572_60 = *n_9(D); if (D.1572_60 > 0) goto <bb 7>; else goto <bb 8>; Here _45 and _60 are equivalent, but VRP doesn't know this, hence it doesn't detect the goto <bb 8> as dead. The equivalence is only detected after PRE (not by PRE, though :-/ ), which means VRP2 does detect the jump as dead, and hence leaves only the step>0 case in the code. But this is too late for the late PRE (running before VRP2 and the loop optimizers) in order to move the dependend division to the outer loop. As the division isn't moved as loop invariant to the outer loop this also means that the loop count determination doesn't work, hence no unrolling. But the slowness itself is due to the div instruction in the second loop, instead of in the outer loop as with 4.3. -- matz at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |matz at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108