[Bug c++/96709] New: cmov and vectorization

2020-08-19 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96709

Bug ID: 96709
   Summary: cmov and vectorization
   Product: gcc
   Version: unknown
   URL: https://godbolt.org/z/GKnj17
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: g.peterh...@t-online.de
  Target Milestone: ---

Hello gcc team,
I noticed 2 problems:
1) the compiler does not generate cmov commands
2) the auto-vectorization is very unreliable

I would like to clarify this using the example of a stable shift-left, see
https://godbolt.org/z/GKnj17
I have implemented several variants for this.

to 1)
Only silent::conditional_move generates a cmov, all other cases do not.

to 2)
- The auto-vectorization only works if the smaller of the two arrays (val and
bit) is at least as large as an sse register, although the values ​​could be
adjusted.
- If vectorization is used at all, often only 128-bit code is generated
(_mm_XXX) instead of 256-bit (avx _mm256_XXX) or larger.
- The 16-bit shift commands from AVX512 (_mmXXX_sllv_epi16) are not used if a
suitable architecture is selected.

The complex shl_attempt_vectorize function works a little better, but not 100%
either. Play around with the array size, the value/shift-types and the
functions!

Best regards
Gero

[Bug target/96709] cmov and vectorization

2020-08-24 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96709

--- Comment #2 from g.peterh...@t-online.de ---
You can choose the boost version on godbolt.org. The example uses 1.73, but
only the macros
#define BOOST_FORCEINLINE inline __attribute__ ((__always_inline__))
and
#define BOOST_NOINLINE __attribute__ ((__noinline__)).

[Bug target/90492] simple array-copy not use available simd-registers

2019-12-20 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90492

g.peterh...@t-online.de changed:

   What|Removed |Added

  Known to fail||10.0

--- Comment #6 from g.peterh...@t-online.de ---
The bug still contained in gcc 10.0.0 20191210 ?! When can I expect this to be
fixed?

[Bug c++/90491] New: simple operation with unsigned integer and conversion to float/double not vectorized

2019-05-15 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90491

Bug ID: 90491
   Summary: simple operation with unsigned integer and conversion
to float/double not vectorized
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: g.peterh...@t-online.de
  Target Milestone: ---

snip

#include 
#include 

int main(const int argc, const char** argv)
{
using value_type = float; // or double
using array_type = std::array; // size does not matter

array_type  a;

/*
 * this loop not vectorized
 * explicite conversion a[i] = argc + int(i) works
 */
for (size_t i=0; ihttp://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
--with-slibdir=/lib64 --with-system-zlib --enable-libstdcxx-allocator=new
--disable-libstdcxx-pch --enable-version-specific-runtime-libs
--with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex
--enable-gnu-indirect-function --program-suffix=-8 --without-system-libunwind
--enable-multilib --with-arch-32=x86-64 --with-tune=generic
--build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 8.3.1 20190226 [gcc-8-branch revision 269204] (SUSE Linux)

[Bug c++/90492] New: simple array-copy not use available simd-registers

2019-05-15 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90492

Bug ID: 90492
   Summary: simple array-copy not use available simd-registers
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: g.peterh...@t-online.de
  Target Milestone: ---

snip

#include 
#include 

int main(const int argc, const char** argv)
{
using value_type = int; // type does not matter
using array_type = std::array;

array_type  a, b;

// simple init
for (size_t i=0; ihttp://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
--with-slibdir=/lib64 --with-system-zlib --enable-libstdcxx-allocator=new
--disable-libstdcxx-pch --enable-version-specific-runtime-libs
--with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex
--enable-gnu-indirect-function --program-suffix=-8 --without-system-libunwind
--enable-multilib --with-arch-32=x86-64 --with-tune=generic
--build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 8.3.1 20190226 [gcc-8-branch revision 269204] (SUSE Linux)

[Bug target/90492] simple array-copy not use available simd-registers

2019-05-15 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90492

--- Comment #3 from g.peterh...@t-online.de ---
Am 15.05.19 um 21:20 schrieb glisse at gcc dot gnu.org:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90492
> 
> --- Comment #1 from Marc Glisse  ---
>> copy's use only sse-registers and never higher
> 
> What do you mean by that? Do you want AVX? Then you should let the compiler
> know that they are available (for instance -march=native).
> 

Yes, i'm use -march=native on Ryzen 7 2700 (has avx/avx2) or you compile with
-march=skylake-avx512, but copy-operations use only sse-registers in all cases.

[Bug target/90492] simple array-copy not use available simd-registers

2019-05-15 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90492

--- Comment #4 from g.peterh...@t-online.de ---
#include 
#include 

int main(const int argc, const char** argv)
{
using value_type = int64_t;
using array_type = std::array;

array_type  a, b;

for (size_t i=0; i:
0:  55  push   %rbp
1:  48 89 e5mov%rsp,%rbp
4:  41 54   push   %r12
6:  53  push   %rbx
7:  48 83 e4 c0 and$0xffc0,%rsp
b:  48 8d a4 24 c0 fe fflea-0x140(%rsp),%rsp
   12:  ff
   13:  62 f1 fd 48 6f 05 00vmovdqa64 0x0(%rip),%zmm0# 1d

   1a:  00 00 00
19: R_X86_64_PC32   .rodata-0x4
   1d:  48 8d 9c 24 c0 00 00lea0xc0(%rsp),%rbx
   24:  00
   25:  62 f1 fd 48 7f 44 24vmovdqa64 %zmm0,0x40(%rsp)
   2c:  01
   2d:  c5 f9 6f d0 vmovdqa %xmm0,%xmm2
   31:  62 f1 fd 48 6f 05 00vmovdqa64 0x0(%rip),%zmm0# 3b

   38:  00 00 00
37: R_X86_64_PC32   .rodata+0x3c
   3b:  4c 8d a4 24 40 01 00lea0x140(%rsp),%r12
   42:  00
   43:  62 f1 fd 48 7f 44 24vmovdqa64 %zmm0,0x80(%rsp)
   4a:  02
   4b:  62 f1 fd 08 6f 5c 24vmovdqa64 0x50(%rsp),%xmm3
   52:  05
   53:  62 f1 fd 08 6f 64 24vmovdqa64 0x60(%rsp),%xmm4
   5a:  06
   5b:  62 f1 fd 08 6f 6c 24vmovdqa64 0x70(%rsp),%xmm5
   62:  07
   63:  62 f1 fd 08 6f 74 24vmovdqa64 0x90(%rsp),%xmm6
   6a:  09
   6b:  62 f1 fd 08 6f 7c 24vmovdqa64 0xa0(%rsp),%xmm7
   72:  0a
   73:  62 f1 fd 08 6f 4c 24vmovdqa64 0xb0(%rsp),%xmm1
   7a:  0b
   7b:  62 f1 fd 08 7f 54 24vmovdqa64 %xmm2,0xc0(%rsp)
   82:  0c
   83:  62 f1 fd 08 7f 5c 24vmovdqa64 %xmm3,0xd0(%rsp)
   8a:  0d
   8b:  62 f1 fd 08 7f 64 24vmovdqa64 %xmm4,0xe0(%rsp)
   92:  0e
   93:  62 f1 fd 08 7f 44 24vmovdqa64 %xmm0,0x100(%rsp)
   9a:  10
   9b:  62 f1 fd 08 7f 6c 24vmovdqa64 %xmm5,0xf0(%rsp)
   a2:  0f
   a3:  62 f1 fd 08 7f 74 24vmovdqa64 %xmm6,0x110(%rsp)
   aa:  11
   ab:  62 f1 fd 08 7f 7c 24vmovdqa64 %xmm7,0x120(%rsp)
   b2:  12
   b3:  62 f1 fd 08 7f 4c 24vmovdqa64 %xmm1,0x130(%rsp)
   ba:  13
   bb:  0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   c0:  48 8b 33mov(%rbx),%rsi
   c3:  bf 00 00 00 00  mov$0x0,%edi
c4: R_X86_64_32 std::cout
   c8:  48 83 c3 08 add$0x8,%rbx
   cc:  e8 00 00 00 00  callq  d1 
cd: R_X86_64_PLT32  std::ostream&
std::ostream::_M_insert(long)-0x4
   d1:  48 89 c7mov%rax,%rdi
   d4:  ba 01 00 00 00  mov$0x1,%edx
   d9:  c6 44 24 3f 20  movb   $0x20,0x3f(%rsp)
   de:  48 8d 74 24 3f  lea0x3f(%rsp),%rsi
   e3:  e8 00 00 00 00  callq  e8 
e4: R_X86_64_PLT32  std::basic_ostream >& std::__ostream_insert
>(std::basic_ostream >&, char const*, long)-0x4
   e8:  49 39 dccmp%rbx,%r12
   eb:  75 d3   jnec0 
   ed:  48 8d 65 f0 lea-0x10(%rbp),%rsp
   f1:  31 c0   xor%eax,%eax
   f3:  5b  pop%rbx
   f4:  41 5c   pop%r12
   f6:  5d  pop%rbp
   f7:  c3  retq
   f8:  0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
   ff:  00

[Bug tree-optimization/90491] simple operation with unsigned integer and conversion to float/double not vectorized

2019-05-15 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90491

--- Comment #2 from g.peterh...@t-online.de ---
example:

#include 
#include 

int main(const int argc, const char** argv)
{
using value_type = float;
using array_type = std::array;

array_type  a;

for (size_t i=0; i:
0:  55  push   %rbp
1:  48 63 ffmovslq %edi,%rdi
4:  53  push   %rbx
5:  48 8d 64 24 a8  lea-0x58(%rsp),%rsp
a:  48 85 fftest   %rdi,%rdi
d:  0f 88 b9 01 00 00   js 1cc 
   13:  c4 e1 fa 2a c7  vcvtsi2ss %rdi,%xmm0,%xmm0
   18:  c5 fa 11 44 24 10   vmovss %xmm0,0x10(%rsp)
   1e:  48 89 f8mov%rdi,%rax
   21:  48 83 c0 01 add$0x1,%rax
   25:  0f 88 2a 03 00 00   js 355 
   2b:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   30:  c5 fa 11 44 24 14   vmovss %xmm0,0x14(%rsp)
   36:  48 89 f8mov%rdi,%rax
   39:  48 83 c0 02 add$0x2,%rax
   3d:  0f 88 f8 02 00 00   js 33b 
   43:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   48:  c5 fa 11 44 24 18   vmovss %xmm0,0x18(%rsp)
   4e:  48 89 f8mov%rdi,%rax
   51:  48 83 c0 03 add$0x3,%rax
   55:  0f 88 c6 02 00 00   js 321 
   5b:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   60:  c5 fa 11 44 24 1c   vmovss %xmm0,0x1c(%rsp)
   66:  48 89 f8mov%rdi,%rax
   69:  48 83 c0 04 add$0x4,%rax
   6d:  0f 88 94 02 00 00   js 307 
   73:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   78:  c5 fa 11 44 24 20   vmovss %xmm0,0x20(%rsp)
   7e:  48 89 f8mov%rdi,%rax
   81:  48 83 c0 05 add$0x5,%rax
   85:  0f 88 62 02 00 00   js 2ed 
   8b:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   90:  c5 fa 11 44 24 24   vmovss %xmm0,0x24(%rsp)
   96:  48 89 f8mov%rdi,%rax
   99:  48 83 c0 06 add$0x6,%rax
   9d:  0f 88 30 02 00 00   js 2d3 
   a3:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   a8:  c5 fa 11 44 24 28   vmovss %xmm0,0x28(%rsp)
   ae:  48 89 f8mov%rdi,%rax
   b1:  48 83 c0 07 add$0x7,%rax
   b5:  0f 88 fe 01 00 00   js 2b9 
   bb:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   c0:  c5 fa 11 44 24 2c   vmovss %xmm0,0x2c(%rsp)
   c6:  48 89 f8mov%rdi,%rax
   c9:  48 83 c0 08 add$0x8,%rax
   cd:  0f 88 cc 01 00 00   js 29f 
   d3:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   d8:  c5 fa 11 44 24 30   vmovss %xmm0,0x30(%rsp)
   de:  48 89 f8mov%rdi,%rax
   e1:  48 83 c0 09 add$0x9,%rax
   e5:  0f 88 9a 01 00 00   js 285 
   eb:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
   f0:  c5 fa 11 44 24 34   vmovss %xmm0,0x34(%rsp)
   f6:  48 89 f8mov%rdi,%rax
   f9:  48 83 c0 0a add$0xa,%rax
   fd:  0f 88 68 01 00 00   js 26b 
  103:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
  108:  c5 fa 11 44 24 38   vmovss %xmm0,0x38(%rsp)
  10e:  48 89 f8mov%rdi,%rax
  111:  48 83 c0 0b add$0xb,%rax
  115:  0f 88 36 01 00 00   js 251 
  11b:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
  120:  c5 fa 11 44 24 3c   vmovss %xmm0,0x3c(%rsp)
  126:  48 89 f8mov%rdi,%rax
  129:  48 83 c0 0c add$0xc,%rax
  12d:  0f 88 04 01 00 00   js 237 
  133:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
  138:  c5 fa 11 44 24 40   vmovss %xmm0,0x40(%rsp)
  13e:  48 89 f8mov%rdi,%rax
  141:  48 83 c0 0d add$0xd,%rax
  145:  0f 88 d2 00 00 00   js 21d 
  14b:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
  150:  c5 fa 11 44 24 44   vmovss %xmm0,0x44(%rsp)
  156:  48 89 f8mov%rdi,%rax
  159:  48 83 c0 0e add$0xe,%rax
  15d:  0f 88 a0 00 00 00   js 203 
  163:  c4 e1 fa 2a c0  vcvtsi2ss %rax,%xmm0,%xmm0
  168:  c5 fa 11 44 24 48   vmovss %xmm0,0x48(%rsp)
  16e:  48 83 c7 0f add$0xf,%rdi
  172:  78 75   js 1e9 
  174:  c4 e1 fa 2a c7  vcvtsi2ss %rdi,%xmm0,%xmm0
  179:  c5 fa 11 44 24 4c   vmovss %xmm0,0x4c(%rsp)
  17f:  48 8d 5c 24 10  lea0x10(%rsp),%rbx
  184:  48 8d 6c 24 50  lea0x50(%rsp),%rbp
  189:  0f 1f 80 00 00 00 00nopl   0x0(%rax)
  190:  c5 fa 10 03 vmovss (%rbx),%xmm0
  194:  bf 00 00 00 00  mov$0x0,%edi
195: R_X86_64_32std::cout
  199:  c5 fa 5a c0 vcvtss2sd %xmm0,%xmm0,%xmm0
  19d:  48 83 c3 04 add$0x4,%rbx
  1a1:  e8 00 00 00 00  callq  1a6

[Bug target/90600] New: incompatible 64-bit-types in x86-intrinsics

2019-05-23 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90600

Bug ID: 90600
   Summary: incompatible 64-bit-types in x86-intrinsics
   Product: gcc
   Version: 9.1.1
Status: UNCONFIRMED
  Keywords: ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: g.peterh...@t-online.de
  Target Milestone: ---
  Host: x86-64
Target: x86-64
 Build: 9.1.1

COLLECT_GCC=gcc-9
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=hsa:nvptx-none
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
--mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
--enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d
--enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver
--disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=https://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=-9 --without-system-libunwind --enable-multilib
--with-arch-32=x86-64 --with-tune=generic
--with-build-config=bootstrap-lto-lean --enable-link-mutex
--build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 9.1.1 20190520 [gcc-9-branch revision 271396] (SUSE Linux) 

snip 1:
using intrinsic_int64_t = decltype(_mm_cvtsi128_si64(__m128i{}));
std::cout<)<
false
true


snip 2:
uint64_ta, b, r;
uint8_t carry;
carry = _addcarry_u64(carry, a, b, &r);
->
error: invalid conversion from ‘uint64_t*’ {aka ‘long unsigned int*’} to ‘long
long unsigned int*’ [-fpermissive]


Hello,
you're using incompatible 64-bit-types in the x86-intrinsics. Why are not
always and everywhere the default-types taken from "types.h"?
PS: Is there any hope that the completely outdated (from the 70')
short/int/long-types will be completely replaced by u/intX_t (keywords!)? With
(signed/unsigned) char != u/int8_t, so that you can write:
uint8_t ui = 65;
int8_t  si = 65;
unsigned char   uc = 65;
signed char sc = 65;
charc = 65;
std::cout << ui ...
->
65
65
A
A
A

reguards
Gero

[Bug target/90600] incompatible 64-bit-types in x86-intrinsics

2019-05-23 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90600

--- Comment #2 from g.peterh...@t-online.de ---
Am 23.05.19 um 19:04 schrieb jakub at gcc dot gnu.org:
> Note, clang agrees with gcc here, and I don't think it is a good idea to 
> change
> this incompatibly.
I think it would be better if there is (on the respective platform) only
exactly an absolute type for the significant size and this is consistently used
everywhere. Then such problems can not occur at all.
PS: I miss the IO-Routines for __int128 (u/int128_t), clang has this. Will they
be retrofitted? On 32-bit platforms (and smaller) these types do not exist
either. (Can that clang?)

[Bug target/90600] incompatible 64-bit-types in x86-intrinsics

2019-05-23 Thread g.peterh...@t-online.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90600

--- Comment #4 from g.peterh...@t-online.de ---
Am 23.05.19 um 20:11 schrieb glisse at gcc dot gnu.org:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90600
> 
> --- Comment #3 from Marc Glisse  ---
> Intel documents that it uses "unsigned __int64" but I don't see where they
> document what __int64 is. We could take a "void *out" argument and cast it
> inside the function, but that would lose useful diagnostics for people trying
> to pass a 32-bit type. We could overload in C++. Not sure any of that is worth
> the trouble, those interfaces are target-specific anyway.
> 
What else should "unsigned __int64" be than a uint64_t (0..2^64-1)? Then this
would look exactly like this:

external __inline uint8_t
__attribute __ ((__ gnu_inline__, __always_inline__, __artificial__))
_addcarry_u64 (uint8_t __CF, uint64_t __X, uint64_t __Y, uint64_t * __P)
{
return __builtin_ia32_addcarryx_u64 (__CF, __X, __Y, __P);
}

And I miss addcarry/subborrow for uint8/16/128. You could make that available
as a general __builtin :-)
Of course it would be better if such functions are included in the C/C++
standard ...