https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684
ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed| |2024-09-11 CC| |ktkachov at gcc dot gnu.org, | |tnfchris at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a USDOT instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised with muls and widening adds. -O2: .L2: ldr s29, [x2, x3] ld1b z27.b, p7/z, [x1, x3] sel z27.b, p7, z27.b, z30.b fmov s28, s29 movprfx z29, z31 insr z29.s, s28 ld1b z28.b, p7/z, [x0] usdot z29.s, z28.b, z27.b uaddv d29, p6, z29.s str s29, [x2, x3] add x3, x3, 4 cmp x3, 64 bne .L2 -O3: ld4 {v24.16b - v27.16b}, [x1] ldrb w3, [x0] ldrb w1, [x0, 1] ldp q29, q28, [x2] dup v4.4h, w3 ldp q31, q30, [x2, 32] dup v5.4h, w1 ldrb w1, [x0, 2] sxtl v16.8h, v24.8b sxtl2 v24.8h, v24.16b ldrb w0, [x0, 3] sxtl v17.8h, v25.8b sxtl2 v25.8h, v25.16b sxtl v18.8h, v26.8b dup v6.4h, w1 sxtl2 v26.8h, v26.16b sxtl v19.8h, v27.8b mul v24.8h, v24.8h, v4.h[0] dup v7.4h, w0 mul v20.8h, v16.8h, v4.h[0] sxtl2 v27.8h, v27.16b mul v21.8h, v17.8h, v5.h[0] mul v25.8h, v25.8h, v5.h[0] saddw v31.4s, v31.4s, v24.4h mul v23.8h, v18.8h, v6.h[0] saddw2 v30.4s, v30.4s, v24.8h saddw v29.4s, v29.4s, v20.4h mul v26.8h, v26.8h, v6.h[0] saddw2 v28.4s, v28.4s, v20.8h mul v24.8h, v19.8h, v7.h[0] saddw v29.4s, v29.4s, v21.4h saddw2 v28.4s, v28.4s, v21.8h saddw v31.4s, v31.4s, v25.4h mul v27.8h, v27.8h, v7.h[0] saddw2 v30.4s, v30.4s, v25.8h saddw v29.4s, v29.4s, v23.4h saddw2 v28.4s, v28.4s, v23.8h saddw v31.4s, v31.4s, v26.4h saddw2 v30.4s, v30.4s, v26.8h saddw v29.4s, v29.4s, v24.4h saddw2 v28.4s, v28.4s, v24.8h saddw v31.4s, v31.4s, v27.4h saddw2 v30.4s, v30.4s, v27.8h stp q29, q28, [x2] stp q31, q30, [x2, 32] The O3 version does fully unroll the loop so it's probably better but maybe it could do a better job of using USDOT?