https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118861
Bug ID: 118861 Summary: 32bit loop transformed into 64bit loop on Aarch32 Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: phdiv at fastmail dot fm Target Milestone: --- Consider the following code: #include <stdint.h> extern volatile uint64_t reg64; void f(){ for(uint32_t i=0; i<10000; ++i) reg64 = (static_cast<uint64_t>(i) << 32) | i; } For 32bit ARM, this can be compiled in a straightforward way: ldr r2, .LCPI0_0 mov r0, #0 movw r3, #10000 ldr r2, [pc, r2] .LBB0_1: mov r1, r0 strd r0, r1, [r2] add r0, r0, #1 cmp r0, r3 bne .LBB0_1 bx lr .LCPI0_0: .long reg64(GOT_PREL)-((.LPC0_0+8)-.Ltmp5) (That's what Clang does.) GCC somehow converts the 32bit loop into a 64bit loop, incrementing and comparing upper and lower half independently with several register moves: str lr, [sp, #-4]! movw lr, #:lower16:reg64 movt lr, #:upper16:reg64 mov r2, #0 mov r3, #0 movw r1, #10000 .L2: adds ip, r2, #1 strd r2, [lr] adc r0, r3, #1 mov r2, ip mov r3, r0 cmp r0, r1 cmpeq ip, r1 bne .L2 ldr pc, [sp], #4 It seems some optimization went the wrong way somewhere. (https://godbolt.org/z/exaaGo3rY)