https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189

            Bug ID: 82189
           Summary: Two level SLP needed
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
void f(float *restrict a, float * restrict b, float * restrict c, float t)
{
  int i = 0 ;
  a[i] = b[i]/t;
  a[i+1] = b[i+1]/t;
  a[i+2] = c[i]/t;
  a[i+3] = c[i+1]/t;
}

Right now we do SLP once (at -O3) and produce:
f:
        dup     v2.2s, v0.s[0]
        ldr     d1, [x1]
        ldr     d0, [x2]
        fdiv    v1.2s, v1.2s, v2.2s
        fdiv    v0.2s, v0.2s, v2.2s
        stp     d1, d0, [x0]
        ret

But it might be better do:
f:
        dup     v2.4s, v0.s[0]
        ldr     d0, [x1]
        ldr     d1, [x2]
        ins     v0.2d[1], v1.2d[0]
        fdiv    v0.4s, v0.4s, v2.4s
        str     q0, [x0]
        ret

Mainly because two div is usually not pipelined.

Reply via email to