https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118028
Bug ID: 118028 Summary: A better vectorized reduction across multi-level loop-nest Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Look at the case: int foo(const int *array) { int sum = 0; #pragma GCC unroll 0 for (int i = 0; i < 32; ++i) { #pragma GCC unroll 0 for (int j = 0; j < 32; ++j) { sum += array[i * 64 + j]; } } return sum; } For sure, the accumulation on "sum" is vectorized with inner loop. Since the outer loop is of scalarized form, current means would combine the vectorized "sum" to a scalar one via .REDUC_PLUS when execution backs to the outer loop as: int sum = 0; for (int i = 0; i < 32; ++i) { vector(4) int v_sum = { sum, 0, 0, 0 }; for (int j = 0; j < 32; j += 4) { v_sum += *(vector(4) int *)(&array[i * 64 + j]); } sum += .REDUC_PLUS(v_sum); } Because there is no other use of "sum" except handing over its value to next round of accumulation in inner loop, the vector->scalar translation of "sum" in the outer loop is not really needed, a more efficient means is to move the computation to a point after exit of the whole loop nest as: vector(4) int v_sum = { 0, 0, 0, 0 }; for (int i = 0; i < 32; ++i) { for (int j = 0; j < 32; j += 4) { v_sum += *(vector(4) int *)(&array[i * 64 + j]); } } sum = .REDUC_PLUS(v_sum);