https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118028

            Bug ID: 118028
           Summary: A better vectorized reduction across multi-level
                    loop-nest
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Look at the case:

  int foo(const int *array)
  {
    int sum = 0;

    #pragma GCC unroll 0
    for (int i = 0; i < 32; ++i) {

      #pragma GCC unroll 0
      for (int j = 0; j < 32; ++j) {
        sum += array[i * 64 + j];
      }
    }

    return sum;
  }

For sure, the accumulation on "sum" is vectorized with inner loop. Since the
outer loop is of scalarized form, current means would combine the vectorized
"sum" to a scalar one via .REDUC_PLUS when execution backs to the outer loop
as:

  int sum = 0;

  for (int i = 0; i < 32; ++i) {

    vector(4) int v_sum = { sum, 0, 0, 0 };

    for (int j = 0; j < 32; j += 4) {
      v_sum += *(vector(4) int *)(&array[i * 64 + j]);
    }

    sum += .REDUC_PLUS(v_sum);
  }

Because there is no other use of "sum" except handing over its value to next
round of accumulation in inner loop, the vector->scalar translation of "sum" in
the outer loop is not really needed, a more efficient means is to move the
computation to a point after exit of the whole loop nest as:

  vector(4) int v_sum = { 0, 0, 0, 0 };

  for (int i = 0; i < 32; ++i) {

    for (int j = 0; j < 32; j += 4) {
      v_sum += *(vector(4) int *)(&array[i * 64 + j]);
    }
  }

  sum = .REDUC_PLUS(v_sum);

Reply via email to