https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113978
Bug ID: 113978 Summary: Misoptimize for long vector load operation Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: xjkp2283572185 at gmail dot com Target Milestone: --- === Compiler === Using built-in specs. COLLECT_GCC=D:\Tools\gcc\bin\g++.exe COLLECT_LTO_WRAPPER=D:/Tools/gcc/bin/../libexec/gcc/x86_64-w64-mingw32/14.0.1/lto-wrapper.exe Target: x86_64-w64-mingw32 Configured with: ../configure --disable-werror --prefix=/home/luo/x86_64-w64-mingw32-native-gcc14 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --enable-multilib --enable-languages=c,c++ --disable-sjlj-exceptions --enable-threads=win32 Thread model: win32 Supported LTO compression algorithms: zlib gcc version 14.0.1 20240130 (experimental) (GCC) === Source Code === using v [[using gnu: vector_size(128)]] = char; auto f(v* p) noexcept { return *p; } === Command === g++ test.cpp -Ofast -march=znver4 === Result === _Z1fPDv128_c: .LFB0: subq $248, %rsp .seh_stackalloc 248 .seh_endprologue vmovdqa64 (%rdx), %zmm0 movq %rcx, %rax vmovdqa64 %zmm0, (%rcx) vmovdqa64 64(%rdx), %zmm0 vmovdqa64 %zmm0, 64(%rcx) vzeroupper addq $248, %rsp ret GCC generates extra stack operation. But clang just generates two load: _Z1fPDv128_c: # @_Z1fPDv128_c # %bb.0: vmovaps (%rcx), %zmm0 vmovaps 64(%rcx), %zmm1 retq