https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99078
Bug ID: 99078 Summary: Optimizer moves struct initialization into loop Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: magiblot at hotmail dot com Target Milestone: --- Consider the following piece of code (https://godbolt.org/z/WhTcbd): > struct S > { > char c[24]; > }; > > void copy(S *dest, unsigned count) > { > S s {}; > for (int i = 0; i < 7; ++i) > s.c[i] = i; > for (int i = 8; i < 15; ++i) > s.c[i] = i; > for (int i = 16; i < 23; ++i) > s.c[i] = i; > while (count--) > *dest++ = s; > } The generated assembly with -O2 looks like this: > copy(S*, unsigned int): > mov QWORD PTR [rsp-24], 0 > pxor xmm0, xmm0 > movups XMMWORD PTR [rsp-40], xmm0 > test esi, esi > je .L1 > mov esi, esi > lea rax, [rsi+rsi*2] > lea rdx, [rdi+rax*8] > .L3: > mov eax, 1541 > mov ecx, 3340 > mov esi, 5396 > mov DWORD PTR [rsp-39], 67305985 > mov WORD PTR [rsp-35], ax > add rdi, 24 > mov DWORD PTR [rsp-32], 185207048 > mov WORD PTR [rsp-28], cx > mov BYTE PTR [rsp-26], 14 > movdqu xmm1, XMMWORD PTR [rsp-40] > mov DWORD PTR [rsp-24], 319951120 > mov WORD PTR [rsp-20], si > mov BYTE PTR [rsp-18], 22 > mov rax, QWORD PTR [rsp-24] > movups XMMWORD PTR [rdi-24], xmm1 > mov QWORD PTR [rdi-8], rax > cmp rdi, rdx > jne .L3 > .L1: > ret It can be seen that the struct initialization has been moved into the loop, which is a severe pessimization. The issue cannot be reproduced if the struct is initialized this way: > S s; > memset(&s, 0, sizeof(s)); But the following still reproduces the issue: > S s {}; > memset(&s, 0, sizeof(s)); Replacing the assignment inside the loop with memcpy does not affect the result. According to Godbolt, the generated assembly has not changed since GCC 7.2. GCC 7.1 does not use vector registers but still initializes the struct inside the loop. GCC 6.4 and earlier do not use vector registers either but do initialize the struct outside the loop, as expected. EXPECTED RESULT Ideally, the loop body would be optimized into something like this: > movdqu xmm1, XMMWORD PTR [rsp-40] > mov rax, QWORD PTR [rsp-24] > .L3: > add rdi, 24 > movups XMMWORD PTR [rdi-24], xmm1 > mov QWORD PTR [rdi-8], rax > cmp rdi, rdx > jne .L3 > .L1: > ret Thank you.