https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Just for reference here is aarch64 assembly for the loop: .L4: ldr x4, [x9, x5] ldr x3, [x8, x5] add x5, x5, 8 mul x6, x4, x3 umulh x3, x4, x3 adds x0, x6, x0 adcs x1, x3, x1 cinc x7, x7, cs cmp x5, 16384 bne .L4 --- CUT ---- addcs might be faster than cinc on some cores but the difference is 1 cycle vs 2 cycles plus the latency would be hidden so it might not matter in the end.