https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684
--- Comment #4 from Feng Xue <fxue at os dot amperecomputing.com> --- (In reply to Richard Biener from comment #3) > Since the reduction opportunity is in the unrolled scalar inner loop we'd > have > to know how DOT_PROD combines lanes which we do not specify but instead > expect the whole vector to be reduced to a single lane. Yes, the VeGen's scheme relies on target-specific lane combining strategy (every 4-element to 1) of dot-product, which is not described in middle-end. > > I think Feng works on related areas, not sure whether exactly covering this > one. Could not. For this case, gcc is in all possibility to unroll the inner loop, while retain the outer. One complication is that widen_op_pattern may break isomorphism of inner 4 add-mult statements, which would impact discovery SLP reduction. An alternative is to match inner 4 add-mul statements as a compound operation which exactly corresponds to semantic of combining every 4-element to 1, as: for (int = 0; i < 16; i++) { output[i] += COMBINE_LANE_4x1(data[0-3], kernel[i][0-3]) } and then recognize it as a dot-product pattern in tree-vect-pattern stage, both happens before slp. > > Implementation wise this is a (SLP) pattern to recognize.