https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97482

            Bug ID: 97482
           Summary: Optimized (-O3) XMM register load incorrectly uses
                    movdqu
           Product: gcc
           Version: 10.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vkkerrata at gmail dot com
  Target Milestone: ---

Created attachment 49396
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49396&action=edit
preprocessed f.c

The code pasted below reproduces this bug. There is a commented line in f.c
that can be used to replace the builtin function call which also exhibits the
bug. I have only encountered this bug with -O3 on. The load of two 64-bit
values into a 128-bit register at lower optimization levels is a two step
process with movq and movhps instructions handling each 64-bit half. In gcc
10.1.0, this can instead be replaced with movdqu, which puts the halves in
"backwards" from what's intended.

Because the optimizer doesn't always choose movdqu, the issue may disappear
with seemingly unrelated changes. The code provided below is in two files
because I was unable to create a reproducer inside a single translation unit.

System Type: Linux, Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz

Build Options: Default; apg installed from ppa:ubuntu-toolchain-r/test

Compile Line: gcc-10 main.c f.c -O3 -save-temps -o movdqu-bug

Compiler Output: None ($? == 0)

$ gcc-10 -v
Using built-in specs.
COLLECT_GCC=gcc-10
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
10.1.0-2ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-gcn/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.1.0 (Ubuntu 10.1.0-2ubuntu1~18.04) 

$ cat f.c 
#include <emmintrin.h>
#include <stdint.h>

uint64_t
f(const uint64_t *in)
{
    // load two 64-bit halves
    // bug: incorrect use of movdqu under -O3
    // Both versions below do the wrong thing.
    __m128i x = _mm_set_epi64x(in[0], in[1]);
    //__m128i x = {in[1], in[0]};

    // permute to illustrate change
    x = _mm_shuffle_epi32(x, _MM_SHUFFLE(1,2,3,0));

    // extract and return the low 64 bits
    return _mm_cvtsi128_si64x(x);
}

$ cat main.c
#include <inttypes.h>
#include <stdio.h>

uint64_t f(const uint64_t *);

int
main(void)
{
    // correct output: 4444444411111111
    // bug output:     2222222233333333
    uint64_t vec[2] = { 0x4444444433333333, 0x2222222211111111 };
    printf("%016"PRIx64"\n", f(vec));
    return 0;
}

Reply via email to