https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97482
Bug ID: 97482 Summary: Optimized (-O3) XMM register load incorrectly uses movdqu Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vkkerrata at gmail dot com Target Milestone: --- Created attachment 49396 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49396&action=edit preprocessed f.c The code pasted below reproduces this bug. There is a commented line in f.c that can be used to replace the builtin function call which also exhibits the bug. I have only encountered this bug with -O3 on. The load of two 64-bit values into a 128-bit register at lower optimization levels is a two step process with movq and movhps instructions handling each 64-bit half. In gcc 10.1.0, this can instead be replaced with movdqu, which puts the halves in "backwards" from what's intended. Because the optimizer doesn't always choose movdqu, the issue may disappear with seemingly unrelated changes. The code provided below is in two files because I was unable to create a reproducer inside a single translation unit. System Type: Linux, Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz Build Options: Default; apg installed from ppa:ubuntu-toolchain-r/test Compile Line: gcc-10 main.c f.c -O3 -save-temps -o movdqu-bug Compiler Output: None ($? == 0) $ gcc-10 -v Using built-in specs. COLLECT_GCC=gcc-10 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.1.0-2ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-gcn/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.1.0 (Ubuntu 10.1.0-2ubuntu1~18.04) $ cat f.c #include <emmintrin.h> #include <stdint.h> uint64_t f(const uint64_t *in) { // load two 64-bit halves // bug: incorrect use of movdqu under -O3 // Both versions below do the wrong thing. __m128i x = _mm_set_epi64x(in[0], in[1]); //__m128i x = {in[1], in[0]}; // permute to illustrate change x = _mm_shuffle_epi32(x, _MM_SHUFFLE(1,2,3,0)); // extract and return the low 64 bits return _mm_cvtsi128_si64x(x); } $ cat main.c #include <inttypes.h> #include <stdio.h> uint64_t f(const uint64_t *); int main(void) { // correct output: 4444444411111111 // bug output: 2222222233333333 uint64_t vec[2] = { 0x4444444433333333, 0x2222222211111111 }; printf("%016"PRIx64"\n", f(vec)); return 0; }