https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118331
Bug ID: 118331
Summary: Poor code when passing small structs around on 32-bit
ARM
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: david at westcontrol dot com
Target Milestone: ---
I have recently been looking at how small structs are passed around for the
32-bit ARM port (in particular, for Cortex-M devices). This all applies to C
and C++, though small structs are more common in C++. The Godbolt link to my
test code is here, comparing gcc to clang :
<https://godbolt.org/z/aeKrcMb64>
(I've tried to write the code in a way that works for C and C++, in case it is
helpful.)
One common missed optimisation that I see repeatedly is that gcc is making a
stack frame unnecessarily when it is returning a small struct. For example,
given:
#include <stdint.h>
typedef struct A2 { uint16_t a; uint16_t b; } A2;
A2 makeA2(void) { A2 x = { 1, 0 }; return x; }
generates:
makeA2:
sub sp, sp, #8
movs r0, #1
add sp, sp, #8
bx lr
The stack pointer manipulation is superfluous.
Even worse stack manipulations can occur when passing structs as parameters:
#include <stdint.h>
typedef struct B2 { uint32_t a; uint32_t b; } B2;
B2 makeB2(void) { B2 x = { 1, 0 }; return x; }
void sinkB2(B2 x);
void callB2() { B2 x = makeB2(); sinkB2(x); }
gives this with gcc:
callB2:
sub sp, sp, #8
movs r2, #1
movs r3, #0
strd r2, [sp]
ldrd r0, r1, [sp]
add sp, sp, #8
b sinkB2
and this with clang:
callB2:
movs r0, #1
movs r1, #0
b sinkB2
gcc is making a stack frame, putting the data in registers, storing that on the
stack, then loading it back into registers again!
Another example of strange code pessimisations came when I was trying to use
vectors to get return values in four gpr registers (instead of the usual one or
two):
typedef uint32_t C1 __attribute__((vector_size(16)));
C1 makeC1(void) { C1 x = { 1 }; return x; }
__attribute__((pcs("aapcs"))) C1 makeC1b(void) { C1 x = { 1 }; return x; }
gcc gives:
makeC1:
movs r1, #0
movs r0, #1
mov r2, r1
mov r3, r1
vmov d0, r0, r1 @ int
vmov d1, r2, r3 @ int
bx lr
I have the vector registers enabled (with "-mcpu=cortex-m7 -mfloat-abi=hard
-mfpu=fpv5-d16" - needed to get good hardware floating point on that target),
so I think it is correct that the SIMD registers d0 and d1 are used here. But
then it is unnecessary to put the data in r0:r3 as well. With the "pcs"
attribute to disable returning in SIMD registers, gcc returns the data in r0:r3
as expected. However, the code to do so is far from expected:
makeC1b:
push {r4, r5, r6, r7}
movs r7, #0
movs r0, #1
movs r1, #0
movs r2, #0
mov r3, r7
pop {r4, r5, r6, r7}
bx lr
clang just loads r0-r3 with immediate values in both cases. I suspect that is
incorrect according to the ABI for makeC1, even though it is actually nicer
code.