[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

ktkachov at gcc dot gnu.org via Gcc-bugs Wed, 11 Sep 2024 09:54:24 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684


ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-09-11
                 CC|                            |ktkachov at gcc dot gnu.org,
                   |                            |tnfchris at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a USDOT
instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised with muls
and widening adds.
-O2:
.L2:
        ldr     s29, [x2, x3]
        ld1b    z27.b, p7/z, [x1, x3]
        sel     z27.b, p7, z27.b, z30.b
        fmov    s28, s29
        movprfx z29, z31
        insr    z29.s, s28
        ld1b    z28.b, p7/z, [x0]
        usdot   z29.s, z28.b, z27.b
        uaddv   d29, p6, z29.s
        str     s29, [x2, x3]
        add     x3, x3, 4
        cmp     x3, 64
        bne     .L2

-O3:
        ld4     {v24.16b - v27.16b}, [x1]
        ldrb    w3, [x0]
        ldrb    w1, [x0, 1]
        ldp     q29, q28, [x2]
        dup     v4.4h, w3
        ldp     q31, q30, [x2, 32]
        dup     v5.4h, w1
        ldrb    w1, [x0, 2]
        sxtl    v16.8h, v24.8b
        sxtl2   v24.8h, v24.16b
        ldrb    w0, [x0, 3]
        sxtl    v17.8h, v25.8b
        sxtl2   v25.8h, v25.16b
        sxtl    v18.8h, v26.8b
        dup     v6.4h, w1
        sxtl2   v26.8h, v26.16b
        sxtl    v19.8h, v27.8b
        mul     v24.8h, v24.8h, v4.h[0]
        dup     v7.4h, w0
        mul     v20.8h, v16.8h, v4.h[0]
        sxtl2   v27.8h, v27.16b
        mul     v21.8h, v17.8h, v5.h[0]
        mul     v25.8h, v25.8h, v5.h[0]
        saddw   v31.4s, v31.4s, v24.4h
        mul     v23.8h, v18.8h, v6.h[0]
        saddw2  v30.4s, v30.4s, v24.8h
        saddw   v29.4s, v29.4s, v20.4h
        mul     v26.8h, v26.8h, v6.h[0]
        saddw2  v28.4s, v28.4s, v20.8h
        mul     v24.8h, v19.8h, v7.h[0]
        saddw   v29.4s, v29.4s, v21.4h
        saddw2  v28.4s, v28.4s, v21.8h
        saddw   v31.4s, v31.4s, v25.4h
        mul     v27.8h, v27.8h, v7.h[0]
        saddw2  v30.4s, v30.4s, v25.8h
        saddw   v29.4s, v29.4s, v23.4h
        saddw2  v28.4s, v28.4s, v23.8h
        saddw   v31.4s, v31.4s, v26.4h
        saddw2  v30.4s, v30.4s, v26.8h
        saddw   v29.4s, v29.4s, v24.4h
        saddw2  v28.4s, v28.4s, v24.8h
        saddw   v31.4s, v31.4s, v27.4h
        saddw2  v30.4s, v30.4s, v27.8h
        stp     q29, q28, [x2]
        stp     q31, q30, [x2, 32]

The O3 version does fully unroll the loop so it's probably better but maybe it
could do a better job of using USDOT?

[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

Reply via email to