[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

fxue at os dot amperecomputing.com via Gcc-bugs Fri, 13 Sep 2024 21:24:34 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684


--- Comment #4 from Feng Xue <fxue at os dot amperecomputing.com> ---
(In reply to Richard Biener from comment #3)
> Since the reduction opportunity is in the unrolled scalar inner loop we'd
> have
> to know how DOT_PROD combines lanes which we do not specify but instead
> expect the whole vector to be reduced to a single lane.

Yes, the VeGen's scheme relies on target-specific lane combining strategy
(every 4-element to 1) of dot-product, which is not described in middle-end.

> 
> I think Feng works on related areas, not sure whether exactly covering this
> one.

Could not.


For this case, gcc is in all possibility to unroll the inner loop, while retain
the outer. One complication is that widen_op_pattern may break isomorphism of
inner 4 add-mult statements, which would impact discovery SLP reduction.

An alternative is to match inner 4 add-mul statements as a compound operation 
which exactly corresponds to semantic of combining every 4-element to 1, as:

   for (int = 0; i < 16; i++) {
      output[i] += COMBINE_LANE_4x1(data[0-3], kernel[i][0-3])
   }

and then recognize it as a dot-product pattern in tree-vect-pattern stage, both
happens before slp.

> 
> Implementation wise this is a (SLP) pattern to recognize.

[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

Reply via email to