https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95018

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
Is libgfortran built with -O2 -funroll-loops or with -O3 (IIRC -O3?).  Note we
see

Estimating sizes for loop 3
 BB: 14, after_exit: 0
  size:   1 _20 = count[n_95];
  size:   1 _21 = _20 + 1;
  size:   1 count[n_95] = _21;
  size:   1 _22 = stride[n_95];
  size:   0 _23 = (long unsigned int) _22;
  size:   1 _44 = _23 - _82;
  size:   1 _45 = _44 * 4;
  size:   1 src_62 = src_85 + _45;
  size:   1 _25 = extent[n_95];
  size:   2 if (_21 == _25)
 BB: 20, after_exit: 1
 BB: 13, after_exit: 0
  size:   1 count[n_95] = 0;
  size:   1 _18 = _22 * _25;
  size:   0 _19 = (long unsigned int) _18;
  size:   1 n_60 = n_95 + 1;
   Induction variable computation will be folded away.
  size:   2 if (dim_43 == n_60)
   Exit condition will be eliminated in last copy.
size: 15-1, last_iteration: 15-3
  Loop size: 15
  Estimated size after unrolling: 129
Making edge 13->20 impossible by redistributing probability to other edges.
../../../trunk/libgfortran/generated/in_pack_i4.c:100:14: optimized: loop with
13 iterations completely unrolled (header execution count 23565294)
Last iteration exit edge was proved true.

Note even with the rs6000 limits turned back to default I see the loop
unrolled (with -O3 or -O2 -funroll-loops).

Checking on x86_64 the file is compiled with -O2 only and we have

size: 17-1, last_iteration: 10-3
  Loop size: 17
  Estimated size after unrolling: 154
Not unrolling loop 3: size would grow.

so what's the speciality on POWER?  Code growth should trigger with -O3 only.
Given we have only a guessed profile (and that does not detect the inner
loop as completely cold) we're allowing growth then.  GCC has no idea the
outer loop iterates more than the inner.

Note re-structuring the loop to use down-counting count[] from extent[] to zero
would be worth experimenting with, likewise "peeling" the dim == 0 loop
and not making the outermost loop key on 'src' (can 'src' be NULL on entry?).

Anyway, completely peeling this loop looks useless - the only benefit
might be better branch prediction (each dimension gets its own entry
in the predictor cache).

If POWER cannot cope with large loops then I wonder why POWER people
increased limits (though even the default limits would unroll the loop).

Thomas - where did you measure the slowness?  For which dimensionality?
I'm quite sure the loop structure will be sub-optimal for certain
input shapes... (stride0 == 1 could even use memcpy for the inner dimension).

Reply via email to