https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113978

            Bug ID: 113978
           Summary: Misoptimize for long vector load operation
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: xjkp2283572185 at gmail dot com
  Target Milestone: ---

===
Compiler
===
Using built-in specs.
COLLECT_GCC=D:\Tools\gcc\bin\g++.exe
COLLECT_LTO_WRAPPER=D:/Tools/gcc/bin/../libexec/gcc/x86_64-w64-mingw32/14.0.1/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../configure --disable-werror
--prefix=/home/luo/x86_64-w64-mingw32-native-gcc14 --host=x86_64-w64-mingw32
--target=x86_64-w64-mingw32 --enable-multilib --enable-languages=c,c++
--disable-sjlj-exceptions --enable-threads=win32
Thread model: win32
Supported LTO compression algorithms: zlib
gcc version 14.0.1 20240130 (experimental) (GCC)

===
Source Code
===
using v [[using gnu: vector_size(128)]] = char;
auto f(v* p) noexcept
{
    return *p;
}

===
Command
===
g++ test.cpp -Ofast -march=znver4

===
Result
===
_Z1fPDv128_c:
.LFB0:
        subq    $248, %rsp
        .seh_stackalloc 248
        .seh_endprologue
        vmovdqa64       (%rdx), %zmm0
        movq    %rcx, %rax
        vmovdqa64       %zmm0, (%rcx)
        vmovdqa64       64(%rdx), %zmm0
        vmovdqa64       %zmm0, 64(%rcx)
        vzeroupper
        addq    $248, %rsp
        ret

GCC generates extra stack operation. But clang just generates two load:
_Z1fPDv128_c:                           # @_Z1fPDv128_c
# %bb.0:
        vmovaps (%rcx), %zmm0
        vmovaps 64(%rcx), %zmm1
        retq

Reply via email to