[Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64

2020-04-19 Thread gcc at kheafield dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663

Bug ID: 94663
   Summary: [missed optimization] _mm512_dpbusds_epi32 generates
excess vmovdqa64
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gcc at kheafield dot com
  Target Milestone: ---

The _mm512_dpbusds_epi32 intrinsic generates extra vmovdqa64 instructions when
used inside a loop.  The underlying instruction, vpdpbusds, adds to an
accumulator, so it is commonly used in loops.  The compiler appears to be
unnecessarily using two registers for the accumulator by copying it.  

Example:

#include "immintrin.h"
__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
c0 = _mm512_dpbusds_epi32(c0, a[i], b0);
c1 = _mm512_dpbusds_epi32(c1, a[i], b1);
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

When compiled with g++ -O3 -mavx512vnni example.cc -S, the main loop is:

.L3:
vmovdqa64   (%rdi), %zmm6
vmovdqa64   %zmm3, %zmm0
vmovdqa64   %zmm4, %zmm2
addq$64, %rdi
vpdpbusds   %zmm5, %zmm6, %zmm0
vpdpbusds   %zmm1, %zmm6, %zmm2
vmovdqa64   %zmm0, %zmm3
vmovdqa64   %zmm2, %zmm4
cmpq%rdi, %rax
jne .L3

It's copying accumulator zmm3 to zmm0, accumulating in zmm0, then copying back
to zmm3.  It should have just used one register.  The same happens for zmm4 and
zmm2.  

Workaround: use inline assembly.  

__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
asm ("vpdpbusds %1, %2, %0" : "+x"(c0) : "mx"(a[i]), "x"(b0));
asm ("vpdpbusds %1, %2, %0" : "+x"(c1) : "mx"(a[i]), "x"(b1));
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

Here, the generated code is better, with no extra moves.  

.L10:
#APP
# 19 "example.cc" 1
vpdpbusds (%rdi), %zmm3, %zmm0
# 0 "" 2
# 20 "example.cc" 1
vpdpbusds (%rdi), %zmm1, %zmm2
# 0 "" 2
#NO_APP
addq$64, %rdi
cmpq%rax, %rdi
jne .L10

Reproduced on the following versions of g++:

g++ -v 
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-9.2.0-r2/work/gcc-9.2.0/configure
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/9.2.0
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/python
--enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt
--disable-werror --with-system-zlib --enable-nls --without-included-gettext
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/
--with-pkgversion='Gentoo 9.2.0-r2 p3' --disable-esp --enable-libstdcxx-time
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64
--disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp
--disable-libmudflap --disable-libssp --disable-systemtap
--enable-vtable-verify --enable-lto --without-isl --enable-default-pie
--enable-default-ssp
Thread model: posix
gcc version 9.2.0 (Gentoo 9.2.0-r2 p3) 

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx
--enable-plugin --enable-default-pie --with-system-z

[Bug c++/94832] New: AVX512 scatter/gather macros lack parentheses when unoptimized

2020-04-28 Thread gcc at kheafield dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94832

Bug ID: 94832
   Summary: AVX512 scatter/gather macros lack parentheses when
unoptimized
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gcc at kheafield dot com
  Target Milestone: ---

This code behaves differently and produces a warning about void * arithmetic
when compiled without optimization:

#include 
void Fail(int *data) {
  _mm512_mask_i32scatter_epi32(data - 1, 0x, _mm512_set1_epi32(1),
_mm512_set1_epi32(1), 1);
}

Warning and writes are based at (void*)data - 1:

g++ -mavx512bw example.cc -c -o example.o
In file included from
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:55,
 from example.cc:1:
example.cc: In function ‘void Foo(int*)’:
example.cc:4:37: warning: pointer of type ‘void *’ used in arithmetic
[-Wpointer-arith]
4 |   _mm512_mask_i32scatter_epi32(data - 1, 0x, _mm512_set1_epi32(1),
_mm512_set1_epi32(1), 1);
  | ^

No warning and writes are based at (void*)(data - 1), the expected behavior:

g++ -mavx512bw example.cc -O3 -c -o example.o
# No output.

If we look at avx512fintrin.h, it becomes clear why:

#ifdef __OPTIMIZE__
/* ... */
extern __inline void
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_mask_i32scatter_epi32 (void *__addr, __mmask16 __mask,
__m512i __index, __m512i __v1, int __scale)
{
  __builtin_ia32_scattersiv16si (__addr, __mask, (__v16si) __index,
 (__v16si) __v1, __scale);
}
/* ... */
#else
/* ... */
#define _mm512_mask_i32scatter_epi32(ADDR, MASK, INDEX, V1, SCALE)  \
  __builtin_ia32_scattersiv16si ((void *)ADDR, (__mmask16)MASK,   \
 (__v16si)(__m512i)INDEX,   \
 (__v16si)(__m512i)V1, (int)SCALE)
/* ... */
#endif

When compiled without optimization, the header uses a macro.  And data - 1 is
mapping to (void*)data - 1, producing a warning about type ‘void *’ used in
arithmetic as well as a different address calculation.  

Tested on two gcc versions.  

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/9.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-9.3.0/work/gcc-9.3.0/configure
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/9.3.0
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/python
--enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt
--disable-werror --with-system-zlib --enable-nls --without-included-gettext
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/
--with-pkgversion='Gentoo 9.3.0 p2' --disable-esp --enable-libstdcxx-time
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64
--disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp
--disable-libmudflap --disable-libssp --disable-libada --disable-systemtap
--enable-vtable-verify --enable-lto --without-isl --enable-default-pie
--enable-default-ssp
Thread model: posix
gcc version 9.3.0 (Gentoo 9.3.0 p2) 

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)

[Bug target/94832] AVX512 scatter/gather macros lack parentheses when unoptimized

2020-04-29 Thread gcc at kheafield dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94832

--- Comment #3 from Kenneth Heafield  ---
Being a macro some of the time also causes trouble with template commas and the
C preprocessor.  

#include 
template  int *TemplatedFunction();
void Fail() {
  _mm512_mask_i32scatter_epi32(TemplatedFunction(), 0x,
_mm512_set1_epi32(1), _mm512_set1_epi32(1), 1);
}

Without optimization, error because the template , is interpreted by the macro. 

g++ -mavx512f -c template.cc 
template.cc:6:118: error: macro "_mm512_mask_i32scatter_epi32" passed 6
arguments, but takes just 5
6 |   _mm512_mask_i32scatter_epi32(TemplatedFunction(), 0x,
_mm512_set1_epi32(1), _mm512_set1_epi32(1), 1);
  |
 ^
In file included from
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:55,
 from template.cc:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/avx512fintrin.h:10475: note:
macro "_mm512_mask_i32scatter_epi32" defined here
10475 | #define _mm512_mask_i32scatter_epi32(ADDR, MASK, INDEX, V1, SCALE) \
  | 
template.cc: In function ‘void Fail()’:
template.cc:6:3: error: ‘_mm512_mask_i32scatter_epi32’ was not declared in this
scope
6 |   _mm512_mask_i32scatter_epi32(TemplatedFunction(), 0x,
_mm512_set1_epi32(1), _mm512_set1_epi32(1), 1);
  |   ^~~~


With optimization, no output.  
g++ -mavx512f -O3 -c template.cc