https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99078
Bug ID: 99078
Summary: Optimizer moves struct initialization into loop
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: magiblot at hotmail dot com
Target Milestone: ---
Consider the following piece of code (https://godbolt.org/z/WhTcbd):
> struct S
> {
> char c[24];
> };
>
> void copy(S *dest, unsigned count)
> {
> S s {};
> for (int i = 0; i < 7; ++i)
> s.c[i] = i;
> for (int i = 8; i < 15; ++i)
> s.c[i] = i;
> for (int i = 16; i < 23; ++i)
> s.c[i] = i;
> while (count--)
> *dest++ = s;
> }
The generated assembly with -O2 looks like this:
> copy(S*, unsigned int):
> mov QWORD PTR [rsp-24], 0
> pxor xmm0, xmm0
> movups XMMWORD PTR [rsp-40], xmm0
> test esi, esi
> je .L1
> mov esi, esi
> lea rax, [rsi+rsi*2]
> lea rdx, [rdi+rax*8]
> .L3:
> mov eax, 1541
> mov ecx, 3340
> mov esi, 5396
> mov DWORD PTR [rsp-39], 67305985
> mov WORD PTR [rsp-35], ax
> add rdi, 24
> mov DWORD PTR [rsp-32], 185207048
> mov WORD PTR [rsp-28], cx
> mov BYTE PTR [rsp-26], 14
> movdqu xmm1, XMMWORD PTR [rsp-40]
> mov DWORD PTR [rsp-24], 319951120
> mov WORD PTR [rsp-20], si
> mov BYTE PTR [rsp-18], 22
> mov rax, QWORD PTR [rsp-24]
> movups XMMWORD PTR [rdi-24], xmm1
> mov QWORD PTR [rdi-8], rax
> cmp rdi, rdx
> jne .L3
> .L1:
> ret
It can be seen that the struct initialization has been moved into the loop,
which is a severe pessimization.
The issue cannot be reproduced if the struct is initialized this way:
> S s;
> memset(&s, 0, sizeof(s));
But the following still reproduces the issue:
> S s {};
> memset(&s, 0, sizeof(s));
Replacing the assignment inside the loop with memcpy does not affect the
result.
According to Godbolt, the generated assembly has not changed since GCC 7.2. GCC
7.1 does not use vector registers but still initializes the struct inside the
loop. GCC 6.4 and earlier do not use vector registers either but do initialize
the struct outside the loop, as expected.
EXPECTED RESULT
Ideally, the loop body would be optimized into something like this:
> movdqu xmm1, XMMWORD PTR [rsp-40]
> mov rax, QWORD PTR [rsp-24]
> .L3:
> add rdi, 24
> movups XMMWORD PTR [rdi-24], xmm1
> mov QWORD PTR [rdi-8], rax
> cmp rdi, rdx
> jne .L3
> .L1:
> ret
Thank you.