https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118861

            Bug ID: 118861
           Summary: 32bit loop transformed into 64bit loop on Aarch32
           Product: gcc
           Version: 14.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: phdiv at fastmail dot fm
  Target Milestone: ---

Consider the following code:

#include <stdint.h>
extern volatile uint64_t reg64;

void f(){
    for(uint32_t i=0; i<10000; ++i)
        reg64 = (static_cast<uint64_t>(i) << 32) | i;
}

For 32bit ARM, this can be compiled in a straightforward way:

        ldr     r2, .LCPI0_0
        mov     r0, #0
        movw    r3, #10000
        ldr     r2, [pc, r2]
.LBB0_1:
        mov     r1, r0
        strd    r0, r1, [r2]
        add     r0, r0, #1
        cmp     r0, r3
        bne     .LBB0_1
        bx      lr
.LCPI0_0:
        .long   reg64(GOT_PREL)-((.LPC0_0+8)-.Ltmp5)

(That's what Clang does.)

GCC somehow converts the 32bit loop into a 64bit loop, incrementing 
and comparing upper and lower half independently with several register 
moves:

        str     lr, [sp, #-4]!
        movw    lr, #:lower16:reg64
        movt    lr, #:upper16:reg64
        mov     r2, #0
        mov     r3, #0
        movw    r1, #10000
.L2:
        adds    ip, r2, #1
        strd    r2, [lr]
        adc     r0, r3, #1
        mov     r2, ip
        mov     r3, r0
        cmp     r0, r1
        cmpeq   ip, r1
        bne     .L2
        ldr     pc, [sp], #4

It seems some optimization went the wrong way somewhere.

(https://godbolt.org/z/exaaGo3rY)

Reply via email to