https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95018
--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Thomas Koenig from comment #29)
> It is also interesting that this variant
>
> --- a/libgfortran/generated/in_pack_i4.c
> +++ b/libgfortran/generated/in_pack_i4.c
> @@ -88,7 +88,7 @@ internal_pack_4 (gfc_array_i4 * source)
> count[0]++;
> /* Advance to the next source element. */
> index_type n = 0;
> - while (count[n] == extent[n])
> + while (n < dim && count[n] == extent[n])
> {
> /* When we get to the end of a dimension, reset it and increment
> the next dimension. */
> @@ -100,7 +100,6 @@ internal_pack_4 (gfc_array_i4 * source)
> if (n == dim)
> {
> src = NULL;
> - break;
> }
> else
> {
>
> does not get peeled.
More optimal would be
count[0]--;
> /* Advance to the next source element. */
> index_type n = 0;
while (count[n] == 0)
{
...
}
note completely peeling the inner loop isn't as bad as it looks, it's
basically making the whole loop
while (1)
{
for (count[0] = 0; count[0] < extent[0]; ++count[0])
{
/* Copy the data. */
*(dest++) = *src;
/* Advance to the next element. */
src += stride0;
}
if (dim == 1)
break;
count[0] = 0;
src -= stride[0] * extent[0];
count[1]++;
if (count[1] != extent[1])
continue;
if (dim == 2)
break;
count[1] = 0;
src -= stride[1] * extent[1];
count[2]++;
if (count[2] != extent[2])
continue;
if (dim == 3)
break;
...
}
which should be quite optimal for speed (branch-prediction wise). One
might want to try to optimize code size a bit, sure. Sacrifying a bit
of speed at the loop exit could be setting extent[n > dim] = 1 and
dropping the if (dim == N) break; checks, leaving just the last.
Likewise changing the iteration from extent[N] to zero might make
the tests smaller. Then as commented in the code pre-computing the
products might help as well - you get one additional load of course.
Interleaving extent and the product data arrays would help cache
locality.
Note writing the loop as above will make GCC recognize it as a loop
nest.