https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95018
--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- Is libgfortran built with -O2 -funroll-loops or with -O3 (IIRC -O3?). Note we see Estimating sizes for loop 3 BB: 14, after_exit: 0 size: 1 _20 = count[n_95]; size: 1 _21 = _20 + 1; size: 1 count[n_95] = _21; size: 1 _22 = stride[n_95]; size: 0 _23 = (long unsigned int) _22; size: 1 _44 = _23 - _82; size: 1 _45 = _44 * 4; size: 1 src_62 = src_85 + _45; size: 1 _25 = extent[n_95]; size: 2 if (_21 == _25) BB: 20, after_exit: 1 BB: 13, after_exit: 0 size: 1 count[n_95] = 0; size: 1 _18 = _22 * _25; size: 0 _19 = (long unsigned int) _18; size: 1 n_60 = n_95 + 1; Induction variable computation will be folded away. size: 2 if (dim_43 == n_60) Exit condition will be eliminated in last copy. size: 15-1, last_iteration: 15-3 Loop size: 15 Estimated size after unrolling: 129 Making edge 13->20 impossible by redistributing probability to other edges. ../../../trunk/libgfortran/generated/in_pack_i4.c:100:14: optimized: loop with 13 iterations completely unrolled (header execution count 23565294) Last iteration exit edge was proved true. Note even with the rs6000 limits turned back to default I see the loop unrolled (with -O3 or -O2 -funroll-loops). Checking on x86_64 the file is compiled with -O2 only and we have size: 17-1, last_iteration: 10-3 Loop size: 17 Estimated size after unrolling: 154 Not unrolling loop 3: size would grow. so what's the speciality on POWER? Code growth should trigger with -O3 only. Given we have only a guessed profile (and that does not detect the inner loop as completely cold) we're allowing growth then. GCC has no idea the outer loop iterates more than the inner. Note re-structuring the loop to use down-counting count[] from extent[] to zero would be worth experimenting with, likewise "peeling" the dim == 0 loop and not making the outermost loop key on 'src' (can 'src' be NULL on entry?). Anyway, completely peeling this loop looks useless - the only benefit might be better branch prediction (each dimension gets its own entry in the predictor cache). If POWER cannot cope with large loops then I wonder why POWER people increased limits (though even the default limits would unroll the loop). Thomas - where did you measure the slowness? For which dimensionality? I'm quite sure the loop structure will be sub-optimal for certain input shapes... (stride0 == 1 could even use memcpy for the inner dimension).