https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95018
--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Thomas Koenig from comment #29) > It is also interesting that this variant > > --- a/libgfortran/generated/in_pack_i4.c > +++ b/libgfortran/generated/in_pack_i4.c > @@ -88,7 +88,7 @@ internal_pack_4 (gfc_array_i4 * source) > count[0]++; > /* Advance to the next source element. */ > index_type n = 0; > - while (count[n] == extent[n]) > + while (n < dim && count[n] == extent[n]) > { > /* When we get to the end of a dimension, reset it and increment > the next dimension. */ > @@ -100,7 +100,6 @@ internal_pack_4 (gfc_array_i4 * source) > if (n == dim) > { > src = NULL; > - break; > } > else > { > > does not get peeled. More optimal would be count[0]--; > /* Advance to the next source element. */ > index_type n = 0; while (count[n] == 0) { ... } note completely peeling the inner loop isn't as bad as it looks, it's basically making the whole loop while (1) { for (count[0] = 0; count[0] < extent[0]; ++count[0]) { /* Copy the data. */ *(dest++) = *src; /* Advance to the next element. */ src += stride0; } if (dim == 1) break; count[0] = 0; src -= stride[0] * extent[0]; count[1]++; if (count[1] != extent[1]) continue; if (dim == 2) break; count[1] = 0; src -= stride[1] * extent[1]; count[2]++; if (count[2] != extent[2]) continue; if (dim == 3) break; ... } which should be quite optimal for speed (branch-prediction wise). One might want to try to optimize code size a bit, sure. Sacrifying a bit of speed at the loop exit could be setting extent[n > dim] = 1 and dropping the if (dim == N) break; checks, leaving just the last. Likewise changing the iteration from extent[N] to zero might make the tests smaller. Then as commented in the code pre-computing the products might help as well - you get one additional load of course. Interleaving extent and the product data arrays would help cache locality. Note writing the loop as above will make GCC recognize it as a loop nest.