https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #17 from ktkachov at gcc dot gnu.org ---
I played around with the source to do some conservative 2x manual unrolling in
the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
loops). This was to try out Richard's thinking suggestion in #c10 about
unrolling for forming load-pairs, and also to break the accumulator dependency.

So the initial testcase now became:
unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double * __restrict__ dst, const double *__restrict__ src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = &val[cols->rowstart[0]];
  const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; row<n_rows; ++row)
    {
      double s = 0.;
      const double *const val_end_of_row = &val[cols->rowstart[row+1]];
      __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;

      if (diff & 1) // Peel the odd iteration.
        s += *val_ptr++ * src[*colnum_ptr++];

      double s1 = 0.; // Second accumulator
      while (val_ptr != val_end_of_row)
        {
          s += val_ptr[0] * src[colnum_ptr[0]];
          s1 += val_ptr[1] * src[colnum_ptr[1]];
          val_ptr += 2;
          colnum_ptr += 2;
        }
      *dst_ptr++ = s + s1;
    }
}

This transformed the initial loop from:
.L4:
        ldr     w3, [x7, x2, lsl 2]
        cmp     x6, x2
        ldr     d2, [x5, x2, lsl 3]
        add     x2, x2, 1
        ldr     d1, [x1, x3, lsl 3]
        fmadd   d0, d2, d1, d0
        bne     .L4

into:
.L5:
        ldp     w6, w5, [x3] // LDP
        add     x3, x3, 8
        ldp     d5, d3, [x2]  // LDP
        add     x2, x2, 16
        ldr     d4, [x1, x6, lsl 3]
        cmp     x4, x2
        ldr     d2, [x1, x5, lsl 3]
        fmadd   d0, d5, d4, d0
        fmadd   d1, d3, d2, d1
        bne     .L5

In parest itself a few of the loops transformed this way did not form LDPs due
to unrelated LDP-forming inefficiencies but the most did.
This transformation gave a 3% improvement on a Cortex-A72. There are more
similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I
suspect if we do something intelligent there as well we'll get even more
sizeable gains.

So rather than solving general "unrolling", how about we break this down into
two desirable transformations:
1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
suggestion)
2) Unrolling and breaking accumulator dependencies.

I think more general unrolling and the peeling associated with it can be
discussed independently of 1) and 2) once we collect more data on more
microarchitectures.

Reply via email to