https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Just for reference here is aarch64 assembly for the loop:
.L4:
        ldr     x4, [x9, x5]
        ldr     x3, [x8, x5]
        add     x5, x5, 8
        mul     x6, x4, x3
        umulh   x3, x4, x3
        adds    x0, x6, x0
        adcs    x1, x3, x1
        cinc    x7, x7, cs
        cmp     x5, 16384
        bne     .L4
--- CUT ----
addcs might be faster than cinc on some cores but the difference is 1 cycle vs
2 cycles plus the latency would be hidden so it might not matter in the end.

Reply via email to