[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

tnfchris at gcc dot gnu.org via Gcc-bugs Wed, 11 Sep 2024 10:28:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684


Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |victorldn at gcc dot gnu.org

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to ktkachov from comment #1)
> Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a
> USDOT instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised

Looks like SLP discovery fails to notice it's a reduction, we do have code to
find   + reduction with SLP but it seems that the issue is here that the store
is used to start the discovery.

The same happens with a normal dotprod

#include <stdint.h>

void
dot_16x1x16_uint8_int8_int32(
   uint8_t data[restrict 4],
   uint8_t kernel[restrict 16][4],
   int32_t output[restrict 16])
{
  for (int i = 0; i < 16; i++)
    for (int k = 0; k < 4; k++)
      output[i] += data[k] * kernel[i][k];
}

> The O3 version does fully unroll the loop so it's probably better but maybe
> it could do a better job of using USDOT?

Yeah, we could get the same effect by implementing the
vect_recog_widen_sum_pattern using dotprod accumulating into a zero register,
and then combine should be able to do the right things.

Victor had a patch at some point I think...

But the real fix is teaching SLP discovery that there's a reduction here.

[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

Reply via email to