Hi Richard, on 2021/7/8 下午8:38, Richard Sandiford via Gcc-patches wrote: > Quoting from the final patch in the series: > > ------------------------------------------------------------------------ > This patch adds support for reusing a main loop's reduction accumulator > in an epilogue loop. This in turn lets the loops share a single piece > of vector->scalar reduction code. > > The patch has the following restrictions: > > (1) The epilogue reduction can only operate on a single vector > (e.g. ncopies must be 1 for non-SLP reductions, and the group size > must be <= the element count for SLP reductions). > > (2) Both loops must use the same vector mode for their accumulators. > This means that the patch is restricted to targets that support > --param vect-partial-vector-usage=1. > > (3) The reduction must be a standard “tree code” reduction. > > However, these restrictions could be lifted in future. For example, > if the main loop operates on 128-bit vectors and the epilogue loop > operates on 64-bit vectors, we could in future reduce the 128-bit > vector by one stage and use the 64-bit result as the starting point > for the epilogue result. > > The patch tries to handle chained SLP reductions, unchained SLP > reductions and non-SLP reductions. It also handles cases in which > the epilogue loop is entered directly (rather than via the main loop) > and cases in which the epilogue loop can be skipped. > ------------------------------------------------------------------------ > > However, it ended up being difficult to do that without some preparatory > clean-ups. Some of them could probably stand on their own, but others > are a bit “meh” without the final patch to justify them. > > The diff below shows the effect of the patch when compiling: > > unsigned short __attribute__((noipa)) > add_loop (unsigned short *x, int n) > { > unsigned short res = 0; > for (int i = 0; i < n; ++i) > res += x[i]; > return res; > } > > with -O3 --param vect-partial-vector-usage=1 on an SVE target: > > add_loop: add_loop: > .LFB0: .LFB0: > .cfi_startproc .cfi_startproc > mov x4, x0 < > cmp w1, 0 cmp w1, 0 > ble .L7 ble .L7 > cnth x0 | cnth x4 > sub w2, w1, #1 sub w2, w1, #1 > sub w3, w0, #1 | sub w3, w4, #1 > cmp w2, w3 cmp w2, w3 > bcc .L8 bcc .L8 > sub w0, w1, w0 | sub w4, w1, w4 > mov x3, 0 mov x3, 0 > cnth x5 cnth x5 > mov z0.b, #0 mov z0.b, #0 > ptrue p0.b, all ptrue p0.b, all > .p2align 3,,7 .p2align 3,,7 > .L4: .L4: > ld1h z1.h, p0/z, [x4, x3, | ld1h z1.h, p0/z, [x0, x3, > mov x2, x3 mov x2, x3 > add x3, x3, x5 add x3, x3, x5 > add z0.h, z0.h, z1.h add z0.h, z0.h, z1.h > cmp w0, w3 | cmp w4, w3 > bcs .L4 bcs .L4 > uaddv d0, p0, z0.h < > umov w0, v0.h[0] < > inch x2 inch x2 > and w0, w0, 65535 < > cmp w1, w2 cmp w1, w2 > beq .L2 | beq .L6 > .L3: .L3: > sub w1, w1, w2 sub w1, w1, w2 > mov z1.b, #0 | add x2, x0, w2, uxtw 1 > whilelo p0.h, wzr, w1 whilelo p0.h, wzr, w1 > add x2, x4, w2, uxtw 1 | ld1h z1.h, p0/z, [x2] > ptrue p1.b, all | add z0.h, p0/m, z0.h, z1. > ld1h z0.h, p0/z, [x2] | .L6: > sel z0.h, p0, z0.h, z1.h | ptrue p0.b, all > uaddv d0, p1, z0.h | uaddv d0, p0, z0.h > fmov x1, d0 | umov w0, v0.h[0] > add w0, w0, w1, uxth < > and w0, w0, 65535 and w0, w0, 65535 > .L2: < > ret ret > .p2align 2,,3 .p2align 2,,3 > .L7: .L7: > mov w0, 0 mov w0, 0 > ret ret > .L8: .L8: > mov w2, 0 mov w2, 0 > mov w0, 0 | mov z0.b, #0 > b .L3 b .L3 > .cfi_endproc .cfi_endproc > > Kewen, could you give this a spin on Power 10 to see whether it > works/helps there? I've attached a combined diff. >
Thanks for the combined diff file. I'm sorry that the current length based partial vector doesn't support reduction, there are no conditional operations for length, we have to preprocess the inactive lanes for the intermediate operations or final reduction operations as operation types since the inactive lane value is supposed to be undefined, this seems to require an efficient way to turn length to a mask vector, Power10 doesn't have the corresponding instruction so we have to do some tricks, it's still on my TODO list. I did a hacking to relax the check in vectorizable_operation for operations involved for reduction, I can see this patch series takes effect for length based partial vector, so I believe it will help length based partial vector once we enable it for reduction later. Thanks for improving this! This patch series was bootstrapped and regress-tested on Power10, also benchmarked with SPEC2017 based on r12-2179 at Ofast unroll, no remarkable regression and improvement was observed. BR, Kewen