https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #17 from ktkachov at gcc dot gnu.org --- > I played around with the source to do some conservative 2x manual unrolling in > the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA > loops). This was to try out Richard's thinking suggestion in #c10 about > unrolling for forming load-pairs, and also to break the accumulator > dependency. > > So the initial testcase now became: > unsigned int *colnums; > double *val; > > struct foostruct > { > unsigned int rows; > unsigned int *colnums; > unsigned int *rowstart; > }; > > struct foostruct *cols; > > void > foo (double * __restrict__ dst, const double *__restrict__ src) > { > const unsigned int n_rows = cols->rows; > const double *val_ptr = &val[cols->rowstart[0]]; > const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; > > double *dst_ptr = dst; > for (unsigned int row=0; row<n_rows; ++row) > { > double s = 0.; > const double *const val_end_of_row = &val[cols->rowstart[row+1]]; > __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr; > > if (diff & 1) // Peel the odd iteration. > s += *val_ptr++ * src[*colnum_ptr++]; > > double s1 = 0.; // Second accumulator > while (val_ptr != val_end_of_row) > { > s += val_ptr[0] * src[colnum_ptr[0]]; > s1 += val_ptr[1] * src[colnum_ptr[1]]; > val_ptr += 2; > colnum_ptr += 2; > } > *dst_ptr++ = s + s1; > } > } > > This transformed the initial loop from: > .L4: > ldr w3, [x7, x2, lsl 2] > cmp x6, x2 > ldr d2, [x5, x2, lsl 3] > add x2, x2, 1 > ldr d1, [x1, x3, lsl 3] > fmadd d0, d2, d1, d0 > bne .L4 > > into: > .L5: > ldp w6, w5, [x3] // LDP > add x3, x3, 8 > ldp d5, d3, [x2] // LDP > add x2, x2, 16 > ldr d4, [x1, x6, lsl 3] > cmp x4, x2 > ldr d2, [x1, x5, lsl 3] > fmadd d0, d5, d4, d0 > fmadd d1, d3, d2, d1 > bne .L5 > > In parest itself a few of the loops transformed this way did not form LDPs due > to unrelated LDP-forming inefficiencies but the most did. > This transformation gave a 3% improvement on a Cortex-A72. There are more > similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so > I > suspect if we do something intelligent there as well we'll get even more > sizeable gains. > > So rather than solving general "unrolling", how about we break this down into > two desirable transformations: > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > suggestion) If that helps, sure (I'd have guessed uarchs are going to split load-multiple into separate loads, but eventually it avoids load-port contention?) > 2) Unrolling and breaking accumulator dependencies. IIRC RTL unrolling can do this (as side-effect, not as main cost motivation) guarded with an extra switch. > I think more general unrolling and the peeling associated with it can be > discussed independently of 1) and 2) once we collect more data on more > microarchitectures. I think both of these can be "implemented" on the RTL unroller side.