https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189
Bug ID: 82189 Summary: Two level SLP needed Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: void f(float *restrict a, float * restrict b, float * restrict c, float t) { int i = 0 ; a[i] = b[i]/t; a[i+1] = b[i+1]/t; a[i+2] = c[i]/t; a[i+3] = c[i+1]/t; } Right now we do SLP once (at -O3) and produce: f: dup v2.2s, v0.s[0] ldr d1, [x1] ldr d0, [x2] fdiv v1.2s, v1.2s, v2.2s fdiv v0.2s, v0.2s, v2.2s stp d1, d0, [x0] ret But it might be better do: f: dup v2.4s, v0.s[0] ldr d0, [x1] ldr d1, [x2] ins v0.2d[1], v1.2d[0] fdiv v0.4s, v0.4s, v2.4s str q0, [x0] ret Mainly because two div is usually not pipelined.