https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |victorldn at gcc dot gnu.org --- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to ktkachov from comment #1) > Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a > USDOT instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised Looks like SLP discovery fails to notice it's a reduction, we do have code to find + reduction with SLP but it seems that the issue is here that the store is used to start the discovery. The same happens with a normal dotprod #include <stdint.h> void dot_16x1x16_uint8_int8_int32( uint8_t data[restrict 4], uint8_t kernel[restrict 16][4], int32_t output[restrict 16]) { for (int i = 0; i < 16; i++) for (int k = 0; k < 4; k++) output[i] += data[k] * kernel[i][k]; } > The O3 version does fully unroll the loop so it's probably better but maybe > it could do a better job of using USDOT? Yeah, we could get the same effect by implementing the vect_recog_widen_sum_pattern using dotprod accumulating into a zero register, and then combine should be able to do the right things. Victor had a patch at some point I think... But the real fix is teaching SLP discovery that there's a reduction here.