https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #17 from ktkachov at gcc dot gnu.org --- I played around with the source to do some conservative 2x manual unrolling in the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA loops). This was to try out Richard's thinking suggestion in #c10 about unrolling for forming load-pairs, and also to break the accumulator dependency. So the initial testcase now became: unsigned int *colnums; double *val; struct foostruct { unsigned int rows; unsigned int *colnums; unsigned int *rowstart; }; struct foostruct *cols; void foo (double * __restrict__ dst, const double *__restrict__ src) { const unsigned int n_rows = cols->rows; const double *val_ptr = &val[cols->rowstart[0]]; const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; double *dst_ptr = dst; for (unsigned int row=0; row<n_rows; ++row) { double s = 0.; const double *const val_end_of_row = &val[cols->rowstart[row+1]]; __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr; if (diff & 1) // Peel the odd iteration. s += *val_ptr++ * src[*colnum_ptr++]; double s1 = 0.; // Second accumulator while (val_ptr != val_end_of_row) { s += val_ptr[0] * src[colnum_ptr[0]]; s1 += val_ptr[1] * src[colnum_ptr[1]]; val_ptr += 2; colnum_ptr += 2; } *dst_ptr++ = s + s1; } } This transformed the initial loop from: .L4: ldr w3, [x7, x2, lsl 2] cmp x6, x2 ldr d2, [x5, x2, lsl 3] add x2, x2, 1 ldr d1, [x1, x3, lsl 3] fmadd d0, d2, d1, d0 bne .L4 into: .L5: ldp w6, w5, [x3] // LDP add x3, x3, 8 ldp d5, d3, [x2] // LDP add x2, x2, 16 ldr d4, [x1, x6, lsl 3] cmp x4, x2 ldr d2, [x1, x5, lsl 3] fmadd d0, d5, d4, d0 fmadd d1, d3, d2, d1 bne .L5 In parest itself a few of the loops transformed this way did not form LDPs due to unrelated LDP-forming inefficiencies but the most did. This transformation gave a 3% improvement on a Cortex-A72. There are more similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I suspect if we do something intelligent there as well we'll get even more sizeable gains. So rather than solving general "unrolling", how about we break this down into two desirable transformations: 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's suggestion) 2) Unrolling and breaking accumulator dependencies. I think more general unrolling and the peeling associated with it can be discussed independently of 1) and 2) once we collect more data on more microarchitectures.