https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393
Bug ID: 103393 Summary: [ 12 Regression ] Auto vectorizer generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jschoen4 at gmail dot com Target Milestone: --- gcc -v Using built-in specs. COLLECT_GCC=/gcc_build/bin/gcc COLLECT_LTO_WRAPPER=/gcc_build/bin/../libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../configure --prefix=/gcc_build --include=/gcc_build/include --disable-multilib --enable-rpath --enable-__cxa_atexit --enable-nls --disable-checking --disable-libunwind-exceptions --enable-bootstrap --enable-shared --enable-static --enable-threads=posix --with-gcc --with-gnu-as --with-gnu-ld --with-system-zlib --enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-lto --enable-stage1-languages=c Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.0.0 20211123 (experimental) (GCC) Branch: trunk, w/ a latest commit of 721d8b9e26bf8205c1f2125c2626919a408cdbe4 =========== =TEST CODE= =========== # cat test.cpp struct TestData { float arr[8]; }; void cpy( TestData& s1, TestData& s2 ) { for(int i=0; i<8; ++i) { s1.arr[i] = s2.arr[i]; } } =========== =cmd = =========== gcc -S -masm=intel -O2 -mavx -mprefer-avx128 -mprefer-vector-width=128 -Wall -Wextra test.cpp -o test.s =========== =BAD ASM = = GCC 12 = =========== cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc vmovdqu ymm0, YMMWORD PTR [rsi] vmovdqu YMMWORD PTR [rdi], ymm0 vzeroupper ret .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 12.0.0 20211123 (experimental)" .section .note.GNU-stack,"",@progbits =========== = GCC 11 = (GCC 10 generates identical asm) =========== cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc mov edx, 32 jmp memmove .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 11.2.0" .section .note.GNU-stack,"",@progbits ========= = GCC 9 = ========= cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc xor eax, eax .p2align 4,,10 .p2align 3 .L2: vmovss xmm0, DWORD PTR [rsi+rax] vmovss DWORD PTR [rdi+rax], xmm0 add rax, 4 cmp rax, 32 jne .L2 ret .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 9.3.0" .section .note.GNU-stack,"",@progbits The auto vectorizer is generating YMM / 256-bit vector instructions with -mprefer-avx128 and -mprefer-vector-width=128 flags specified. This is an issue for low latency software. Using registers 256-bit and wider causes jitter CPU problems on sky lake / cascade lake / ice lake chips. This is true even in cases where the instructions used are considered avx256-light instructions due to a "mix of instructions" being used to determine the power levels (this is also mentioned in intel's optimization manual). Auto vectorizer needs to respect the prefer width flags. Enabling/using newer instruction sets i.e. AVX/AVX2/AVX512 does not require usage of the wider register types.