https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95018

--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Thomas Koenig from comment #29)
> It is also interesting that this variant
> 
> --- a/libgfortran/generated/in_pack_i4.c
> +++ b/libgfortran/generated/in_pack_i4.c
> @@ -88,7 +88,7 @@ internal_pack_4 (gfc_array_i4 * source)
>        count[0]++;
>        /* Advance to the next source element.  */
>        index_type n = 0;
> -      while (count[n] == extent[n])
> +      while (n < dim && count[n] == extent[n])
>          {
>            /* When we get to the end of a dimension, reset it and increment
>               the next dimension.  */
> @@ -100,7 +100,6 @@ internal_pack_4 (gfc_array_i4 * source)
>            if (n == dim)
>              {
>                src = NULL;
> -              break;
>              }
>            else
>              {
> 
> does not get peeled.

More optimal would be

        count[0]--;
>        /* Advance to the next source element.  */
>        index_type n = 0;
        while (count[n] == 0)
          {
...
          }

note completely peeling the inner loop isn't as bad as it looks, it's
basically making the whole loop

  while (1)
    {
      for (count[0] = 0; count[0] < extent[0]; ++count[0])
        {
          /* Copy the data.  */
          *(dest++) = *src;
          /* Advance to the next element.  */
          src += stride0;
        }
      if (dim == 1)
        break;
      count[0] = 0;
      src -= stride[0] * extent[0];
      count[1]++;
      if (count[1] != extent[1])
        continue;
      if (dim == 2)
        break;
      count[1] = 0;
      src -= stride[1] * extent[1];
      count[2]++;
      if (count[2] != extent[2])
        continue;
      if (dim == 3)
        break;
...
    }

which should be quite optimal for speed (branch-prediction wise).  One
might want to try to optimize code size a bit, sure.  Sacrifying a bit
of speed at the loop exit could be setting extent[n > dim] = 1 and
dropping the if (dim == N) break; checks, leaving just the last.
Likewise changing the iteration from extent[N] to zero might make
the tests smaller.  Then as commented in the code pre-computing the
products might help as well - you get one additional load of course.
Interleaving extent and the product data arrays would help cache
locality.

Note writing the loop as above will make GCC recognize it as a loop
nest.

Reply via email to