[Bug bootstrap/58828] New: Problem compiling gcc 4.8.2 using gcc 4.4.6

2013-10-21 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828

Bug ID: 58828
   Summary: Problem compiling gcc 4.8.2 using gcc 4.4.6
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com

I am trying to build gcc 4.8.2, and I got following compilation error:

make[3]: Entering directory `[path]/gcc/obj/gcc'

g++ -g -fkeep-inline-functions -DIN_GCC -fno-exceptions -fno-rtti
-fasynchronous-unwind-tables -W -Wall -Wwrite-strings -Wcast-qual
-Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros
-Wno-overlength-strings -fno-common -DHAVE_CONFIG_H -DGENERATOR_FILE -o
build/genconstants \ build/genconstants.o build/read-md.o build/errors.o
../build-x86_64-unknown-linux-gnu/libiberty/libiberty.a

build/genconstants ../../gcc-4.8.2/gcc/config/i386/i386.md \
   > tmp-constants.h
/bin/sh ../../gcc-4.8.2/gcc/../move-if-change tmp-constants.h insn-constants.h
echo timestamp > s-constants

g++ -g -fkeep-inline-functions -DIN_GCC -fno-exceptions -fno-rtti
-fasynchronous-unwind-tables -W -Wall -Wwrite-strings -Wcast-qual
-Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros
-Wno-overlength-strings -fno-common -DHAVE_CONFIG_H -DGENERATOR_FILE -o
build/gengtype \ build/gengtype.o build/errors.o build/gengtype-lex.o
build/gengtype-parse.o build/gengtype-state.o build/version.o
../build-x86_64-unknown-linux-gnu/libiberty/libiberty.a

build/gengtype.o: In function `double_int::operator*=(double_int)':

[path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:263: undefined reference to
`double_int::operator*(double_int) const'

build/gengtype.o: In function `double_int::operator+=(double_int)':

[path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:270: undefined reference to
`double_int::operator+(double_int) const'

build/gengtype.o: In function `double_int::operator-=(double_int)':

[path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:277: undefined reference to
`double_int::operator-(double_int) const'

collect2: ld returned 1 exit status
make[3]: *** [build/gengtype] Error 1
make[3]: Leaving directory `[path]/gcc/obj/gcc'

gcc is configured this way:

../gcc-4.8.2/configure --prefix=[myprefix] --enable-languages=c,c++
--disable-nls


I compile with sources for all needed tools and libs unpacked into gcc dir.
Here are versions:

binutils-2.23.2.tar.bz2
cloog-0.18.0.tar.gz
gcc-4.8.2.tar.bz2
gmp-5.1.3.tar.bz2
isl-0.11.1.tar.bz2
mpc-1.0.1.tar.gz
mpfr-3.1.2.tar.bz2

gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


[Bug bootstrap/58840] New: Problem compiling gcc 4.7.3 using gcc 4.4.6

2013-10-22 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58840

Bug ID: 58840
   Summary: Problem compiling gcc 4.7.3 using gcc 4.4.6
   Product: gcc
   Version: 4.7.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com

make[3]: Entering directory `[path]/gcc/obj/gcc'
build/gengtype  \
-S ../../gcc-4.7.3/gcc -I gtyp-input.list -w
tmp-gtype.state
../../gcc-4.7.3/gcc/../include/splay-tree.h:55: unidentified type `uintptr_t'
../../gcc-4.7.3/gcc/../include/splay-tree.h:56: unidentified type `uintptr_t'
make[3]: *** [s-gtype] Error 1
make[3]: Leaving directory `[path]/gcc/obj/gcc'
make[2]: *** [all-stage1-gcc] Error 2

GCC is configured in this way:
../gcc-4.7.3/configure --prefix=[myprefix] --enable-languages=c,c++
--disable-nls

Installed compiler:
gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


[Bug bootstrap/58828] Problem compiling gcc 4.8.2 using gcc 4.4.6

2013-10-23 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828

--- Comment #2 from Daniel Fruzynski  ---
Thanks for reply.

As I checked, this also happens when compiling using gcc 4.7.3, so looks that
this is more general problem.

File [path]/gcc/obj/gcc/config.status contains following entry:

configured by [path]/gcc/gcc-4.8.2/gcc/configure, generated by GNU Autoconf
2.64,
  with options \"'--cache-file=./config.cache' '--with-gnu-as' '--with-gnu-ld'
'--prefix=[path]/gcc-4.8.2-linux/' '--disable-nls' '--enable-threads=posix'
'--enable-checking=release' '--enable-__cxa_atexit' '--with-tune=generic'
'--with-arch_32=i686' '--enable-languages=c,c,c++,lto'
'--program-transform-name=s,y,y,' '--disable-option-checking'
'--build=x86_64-redhat-linux' '--host=x86_64-redhat-linux'
'--target=x86_64-redhat-linux' '--srcdir=../../gcc-4.8.2/gcc'
'--disable-intermodule' '--enable-checking=release,types' '--disable-coverage'
'--enable-languages=c,c++,lto' 'build_alias=x86_64-redhat-linux'
'host_alias=x86_64-redhat-linux' 'target_alias=x86_64-redhat-linux'
'CC=x86_64-redhat-linux-gcc' 'CFLAGS=-g -fkeep-inline-functions' 'LDFLAGS= '
'CXX=x86_64-redhat-linux-g++' 'CXXFLAGS=-g -fkeep-inline-functions'
'GMPLIBS=-L[path]/gcc/obj/./gmp/.libs -L[path]/gcc/obj/./mpfr/src/.libs
-L[path]/gcc/obj/./mpc/src/.libs -lmpc -lmpfr -lgmp'
'GMPINC=-I[path]/gcc/obj/./gmp -I[path]/gcc/gcc-4.8.2/gmp
-I[path]/gcc/obj/./mpfr/src -I[path]/gcc/gcc-4.8.2/mpfr/src
-I[path]/gcc/gcc-4.8.2/mpc/src ' 'CLOOGLIBS=' 'CLOOGINC='\" 

So -fkeep-inline-functions was passed from outside. I checked
[path]/gcc/obj/config.status and found this:

S["stage1_cflags"]="-g -fkeep-inline-functions"

Looks that there is some issue with top-level configure script.


[Bug bootstrap/58828] Problem compiling gcc 4.8.2 using gcc 4.4.6

2013-10-23 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828

--- Comment #3 from Daniel Fruzynski  ---
OK, I found it. I used script symlink-tree (distributed with binutils) to
create symlinks to binutils in gcc source dir. This script removed some gcc
source files and replaced them with symlinks to corresponding files in binutils
dir. I assumed that it will help me, but it created more problems.

I am building gcc without binutils symlinked, and build is on stage 2 now. Look
that it will complete successfully.

I think that dedicated script to symlink all binutils into gcc dir would be
useful. Could you create one?


[Bug bootstrap/58840] Problem compiling gcc 4.7.3 using gcc 4.4.6

2013-10-23 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58840

--- Comment #2 from Daniel Fruzynski  ---
OK, I found this. I used script symlink-tree to create symlinks to binutils in
gcc src dir. This script replaced some files with symlinks to their
counterparts in binutil dir, what caused this problem. gcc without these
symlinks compiles fine. So this is not an issue.


[Bug c/58988] New: -Werror=missing-include-dirs does not work

2013-11-04 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58988

Bug ID: 58988
   Summary: -Werror=missing-include-dirs does not work
   Product: gcc
   Version: 4.7.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com

I tried to pass -Werror=missing-include-dirs option to gcc in order to find all
non-existing include dirs and found that this option in broken. In gcc 4.5.2
this option is ignored - gcc does not print any message when non-existing
include dir is specified. gcc 4.7.3 prints warning only.

gcc (both tested versions) changes this warnings into errors when both
-Wmissing-include-dirs -Werror options are used, but this is not an option for
be because of other warnings which are in my code.

I tested this using following command:
g++ -c test.cc -o test.o -I/a -Werror=missing-include-dirs


[Bug c/58988] -Werror=missing-include-dirs does not work

2013-11-05 Thread bugzi...@poradnik-webmastera.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58988

--- Comment #1 from Daniel Fruzynski  ---
gcc 4.8.2 is also affected by this bug - is works in the same way as gcc 4.7.3.


[Bug target/88271] Omit test instruction after add

2018-12-07 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #9 from Daniel Fruzynski  ---
I have idea about alternate approach to this. gcc could try to look for
relations between loop control statement, and other statements which modify
variables used in that control statement. With such knowledge it could try to
reorganize code to better optimize it. This approach would eliminate randomness
here.

[Bug target/88271] Omit test instruction after add

2018-12-10 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #10 from Daniel Fruzynski  ---
Here is possible code transformation to equivalent form, where this
optimization can be simply applied. This change also has a bit surprising side
effect, second nested while loop is unrolled.

[code]
void test2()
{
int level = 0;
int val = 1;
while (1)
{
while(1)
{
val = data[level] << 1;
++level;
if (val)
continue;
else
break;
}

while(1)
{
--level;
val = data[level];
if (!val)
continue;
else
break;

}
}
}
[/code]

[Bug c/88461] New: AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

Bug ID: 88461
   Summary: AVX512: gcc should keep value in kN registers if
possible
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I tried to write piece of code which used new AVX512 logic instructions which
works on kN registers. It turned out that gcc was moving intermediate values
back and forth between kN and eax, what resulted in very poor code.

Example was compiled using gcc 8.2 with -O3 -march=skylake-avx512

[code]
#include 
#include 

int test(uint16_t* data, int a)
{
__m128i v = _mm_load_si128((const __m128i*)data);
__mmask8 m = _mm_testn_epi16_mask(v, v);
m = _kshiftli_mask16(m, 1);
m = _kandn_mask16(m, a);
return m;
}
[/code]

[asm]
test(unsigned short*, int):
vmovdqa64   xmm0, XMMWORD PTR [rdi]
kmovw   k5, esi
vptestnmw   k1, xmm0, xmm0
kmovb   eax, k1
kmovw   k2, eax
kshiftlwk0, k2, 1
kmovw   eax, k0
movzx   eax, al
kmovw   k4, eax
kandnw  k3, k4, k5
kmovw   eax, k3
movzx   eax, al
ret
[/asm]

[Bug target/88461] AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

--- Comment #1 from Daniel Fruzynski  ---
For comparison, this is code generated by icc 19.0.1:

[asm]
test(unsigned short*, int):
vmovdqu   xmm0, XMMWORD PTR [rdi]   #6.48
vptestnmw k0, xmm0, xmm0#7.18
kmovw k2, esi   #11.9
kshiftlw  k1, k0, 1 #9.9
kandnwk3, k1, k2#11.9
kmovb k4, k3#13.12
kmovw eax, k4   #13.12
ret #13.12
[/asm]

[Bug c/81665] Please introduce flags attribute for enums which will mimic one from C#

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81665

--- Comment #4 from Daniel Fruzynski  ---
@Jonathan Wakely: constexpr requires C++11. When I reported this bug, we still
were at C++98 with most of out codebase.

[Bug target/88461] AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

--- Comment #3 from Daniel Fruzynski  ---
Good catch, mask should be 16-bit. Here is fixed version:

[code]
#include 
#include 

int test(uint16_t* data, int a)
{
__m128i v = _mm_load_si128((const __m128i*)data);
__mmask16 m = _mm_testn_epi16_mask(v, v);
m = _kshiftli_mask16(m, 1);
m = _kandn_mask16(m, a);
return m;
}
[/code]

[asm]
test(unsigned short*, int):
vmovdqa64   xmm0, XMMWORD PTR [rdi]
kmovw   k4, esi
vptestnmw   k1, xmm0, xmm0
kmovb   eax, k1
kmovw   k2, eax
kshiftlwk0, k2, 1
kandnw  k3, k0, k4
kmovw   eax, k3
ret
[/asm]

This still can be optimized, there is no need to move value from k1 to eax and
then to k2 - vptestnmw zeroes upper bits if k register.

[Bug c/88465] New: AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

Bug ID: 88465
   Summary: AVX512: optimize loading of constant values to kN
registers
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When constant value is loaded into kN register, gcc puts it into eax first, and
then moved to kN register:

[code]
#include 
#include 

__mmask8 test(__mmask8 m)
{
__mmask8 m2 = _kand_mask8(m, 3);
return m2;
}
[/code]

[asm]
test(unsigned char):
mov eax, 3
kmovb   k1, eax
kmovb   k2, edi
kandb   k0, k1, k2
kmovb   eax, k0
ret
[/asm]

icc uses one instruction for this. https://godbolt.org/ displayed it as "null",
but most probably this is wrong name:

[asm]
test(unsigned char):
vkmovbk0, edi   #6.19
null  k1, 3 #6.19
kandb k2, k0, k1#6.19
vkmovbeax, k2   #6.19
ret #7.12
[/asm]

You can also use instructions kxor and kxnor to load 0 and -1.

[Bug target/88465] AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

--- Comment #2 from Daniel Fruzynski  ---
I have logged issue for CompileExplorer to clarify this null instruction:
https://github.com/mattgodbolt/compiler-explorer/issues/1220

[Bug target/88465] AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

--- Comment #3 from Daniel Fruzynski  ---
This "null" ia an icc bug. Matt Godbolt from Compiler Explorer filed a bug with
Intel: ref 03997020

[Bug target/88473] New: AVX512: constant folding on mask does not remove unnecessary instructions

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473

Bug ID: 88473
   Summary: AVX512: constant folding on mask does not remove
unnecessary instructions
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

void test(void* data, void* data2)
{
__m128i v = _mm_load_si128((__m128i const*)data);
__mmask8 m = _mm_testn_epi16_mask(v, v);
m = _kor_mask8(m, 0x0f);
m = _kor_mask8(m, 0xf0);
v = _mm_maskz_add_epi16(m, v, v);
_mm_store_si128((__m128i*)data2, v);
}
[/code]

Code compiled using gcc 8.2 with -O3 -march=skylake-avx512 . gcc was able to
fold constant expressions and simplify masked vector add to non-masked one.
However original version of folded expression is still present in output:

[asm]
test(void*, void*):
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  mov eax, 15
  vptestnmw k1, xmm0, xmm0
  kmovb k2, eax
  vpaddw xmm0, xmm0, xmm0
  mov eax, -16
  kmovb k3, eax
  vmovaps XMMWORD PTR [rsi], xmm0
  korb k0, k1, k2
  korb k0, k0, k3
  ret
[/asm]

clang properly cleaned it up:

[asm]
test(void*, void*): # @test(void*, void*)
  vmovdqa xmm0, xmmword ptr [rdi]
  vpaddw xmm0, xmm0, xmm0
  vmovdqa xmmword ptr [rsi], xmm0
  ret
[/asm]

[Bug middle-end/88476] New: Optimize expressions which uses vector, mask and general purpose registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88476

Bug ID: 88476
   Summary: Optimize expressions which uses vector, mask and
general purpose registers
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I was playing with Compiler Explorer to see how compilers can optimize various
pieces of code. I found next version of clang (version 8.0.0 (trunk 348905))
can optimize expressions which uses vector, mask and general purpose registers.
Such approach opens new optimization possibilities. Here are two example
functions which demonstrates this:

[code]
#include 

void test1(void* data1, void* data2)
{
__m128i v1 = _mm_load_si128((__m128i const*)data1);
__m128i v2 = _mm_load_si128((__m128i const*)data2);
__mmask8 m1 = _mm_testn_epi16_mask(v1, v1);
__mmask8 m2 = _mm_testn_epi16_mask(v2, v2);
__mmask8 m = (m1 | 3) & (m2 | 3);
v1 = _mm_maskz_add_epi16(m, v1, v2);
_mm_store_si128((__m128i*)data2, v1);
}

void test2(void* data1, void* data2)
{
__m128i v1 = _mm_load_si128((__m128i const*)data1);
__m128i v2 = _mm_load_si128((__m128i const*)data2);
__mmask8 m1 = _mm_testn_epi16_mask(v1, v1);
__mmask8 m2 = _mm_testn_epi16_mask(v2, v2);
m1 = _kor_mask8(m1, 3);
m2 = _kor_mask8(m2, 3);
__mmask8 m = _kand_mask8(m1, m2);
v1 = _mm_maskz_add_epi16(m, v1, v2);
_mm_store_si128((__m128i*)data2, v1);
}
[/code]

When compiled using clang with -O3 -march=skylake-avx512, both are optimized to
the same code:

[asm]
test(void*, void*): # @test(void*, void*)
  vmovdqa xmm0, xmmword ptr [rdi]
  vmovdqa xmm1, xmmword ptr [rsi]
  vpor xmm2, xmm1, xmm0
  vptestnmw k0, xmm2, xmm2
  mov al, 3
  kmovd k1, eax
  korb k1, k0, k1
  vpaddw xmm0 {k1} {z}, xmm1, xmm0
  vmovdqa xmmword ptr [rsi], xmm0
  ret
[/asm]

gcc 9.0.0 20181211 (experimental) produces this:

[asm]
test1(void*, void*):
  vmovdqa64 xmm1, XMMWORD PTR [rsi]
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  vptestnmw k1, xmm1, xmm1
  vptestnmw k2{k1}, xmm0, xmm0
  kmovb eax, k2
  or eax, 3
  kmovb k3, eax
  vpaddw xmm0{k3}{z}, xmm0, xmm1
  vmovaps XMMWORD PTR [rsi], xmm0
  ret
test2(void*, void*):
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  vmovdqa64 xmm1, XMMWORD PTR [rsi]
  vptestnmw k1, xmm0, xmm0
  vptestnmw k3, xmm1, xmm1
  mov eax, 3
  kmovb k2, eax
  korb k1, k1, k2
  korb k0, k3, k2
  kandb k1, k1, k0
  vpaddw xmm0{k1}{z}, xmm0, xmm1
  vmovaps XMMWORD PTR [rsi], xmm0
  ret
[/asm]

[Bug target/88473] AVX512: constant folding on mask does not remove unnecessary instructions

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473

--- Comment #2 from Daniel Fruzynski  ---
I was playing with Compiler Explorer, to see how compilers optimize various
pieces of code. I found that next clang version (currently trunk) will be able
to analyze expressions which spans over vectors, masks and GPRs. I logged Bug
88476 to do something similar in gcc, please take a look. I think such approach
as in clang would be more beneficial.

In the past I also thought about template-based library, which would wrap
vector operations. One of unique concepts was to create separate types to hold
vector with bool values, and another one for int masks. With lazy instantiation
this should lead to faster resulting code. I did not try to write it yet, but
overall this approach look promising for me. With it such cases as in this bug
can 
appear as a side effect of inlining.

[Bug middle-end/88487] New: union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

Bug ID: 88487
   Summary: union prevents autovectorization
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When pointer to data is inside union, loops are not autovectorized. This also
happen when I removed "i" field from union, so it had only one field. Code
compiled with -O3 -mavx

[code]
struct S1
{
union
{
double* __restrict__ * __restrict__ d;
int* __restrict__ * __restrict__ i;
} u;
};

struct S2
{
double* __restrict__ * __restrict__ d;
};

void test1(S1* __restrict__ s1, S1* __restrict__ s2)
{
for (int n = 0; n < 2; ++n)
{
s1->u.d[n][0] = s2->u.d[n][0];
s1->u.d[n][1] = s2->u.d[n][1];
}
}

void test2(S2* __restrict__ s1, S2* __restrict__ s2)
{
for (int n = 0; n < 2; ++n)
{
s1->d[n][0] = s2->d[n][0];
s1->d[n][1] = s2->d[n][1];
}
}
[/code]

[asm]
test1(S1*, S1*):
mov rdx, QWORD PTR [rsi]
mov rax, QWORD PTR [rdi]
mov rsi, QWORD PTR [rdx]
mov rcx, QWORD PTR [rax]
mov rdx, QWORD PTR [rdx+8]
mov rax, QWORD PTR [rax+8]
vmovsd  xmm0, QWORD PTR [rsi]
vmovsd  QWORD PTR [rcx], xmm0
vmovsd  xmm0, QWORD PTR [rsi+8]
vmovsd  QWORD PTR [rcx+8], xmm0
vmovsd  xmm0, QWORD PTR [rdx]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
ret
test2(S2*, S2*):
mov rdx, QWORD PTR [rsi]
mov rax, QWORD PTR [rdi]
mov rcx, QWORD PTR [rdx]
mov rdx, QWORD PTR [rdx+8]
vmovupd xmm0, XMMWORD PTR [rcx]
mov rcx, QWORD PTR [rax]
mov rax, QWORD PTR [rax+8]
vmovups XMMWORD PTR [rcx], xmm0
vmovupd xmm0, XMMWORD PTR [rdx]
vmovups XMMWORD PTR [rax], xmm0
ret
[/asm]

[Bug middle-end/88487] union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #1 from Daniel Fruzynski  ---
Update: when pointers to data are copied to local variables like below,
autovectorization starts working again.

[code]
void test3(S2* __restrict__ s1, S2* __restrict__ s2)
{
double* __restrict__ * __restrict__ d1 = s1->d;
double* __restrict__ * __restrict__ d2 = s2->d;
for (int n = 0; n < 2; ++n)
{
d1[n][0] = d2[n][0];
d1[n][1] = d2[n][1];
}
}
[/code]

[Bug middle-end/88487] union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #2 from Daniel Fruzynski  ---
I spotted that test3 in previous comment uses structure S2 which does not have
union inside. When I changes it to use S1, I got non-vectorized code. So this
workaround does not work.

[Bug middle-end/88490] New: Missed autovectorization when indices are different

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

Bug ID: 88490
   Summary: Missed autovectorization when indices are different
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Code below reads and writes data using different indices what is checked by
"if" above loop. This can be autovectorized, as both memory areas do not
overlap. Code compiled with -O3 -march=skylake-avx512

[code]
struct S
{
double* __restrict__ * __restrict__ d;
};

void test(S* __restrict__ s, int n, int k)
{
if (n > k)
{
for (int n = 0; n < 2; ++n)
{
s->d[n][0] = s->d[k][0];
s->d[n][1] = s->d[k][1];
}
}
}
[/code]

[asm]
test(S*, int, int):
cmp esi, edx
jle .L3
mov rcx, QWORD PTR [rdi]
movsx   rdx, edx
mov rax, QWORD PTR [rcx+rdx*8]
mov rdx, QWORD PTR [rcx]
vmovsd  xmm0, QWORD PTR [rax]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rax+8]
vmovsd  QWORD PTR [rdx+8], xmm0
vmovsd  xmm0, QWORD PTR [rax]
mov rdx, QWORD PTR [rcx+8]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rax+8]
vmovsd  QWORD PTR [rdx+8], xmm0
.L3:
ret
[/asm]

[Bug middle-end/88490] Missed autovectorization when indices are different

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

--- Comment #1 from Daniel Fruzynski  ---
Ehh, small typo. This is correct version, also not vectorized:

[code]
struct S
{
double* __restrict__ * __restrict__ d;
};

void test(S* __restrict__ s, int n, int k)
{
if (n > k)
{
for (int i = 0; i < 2; ++i)
{
s->d[n][0] = s->d[k][0];
s->d[n][1] = s->d[k][1];
}
}
}
[/code]

[asm]
test(S*, int, int):
cmp esi, edx
jle .L3
mov rax, QWORD PTR [rdi]
movsx   rdx, edx
mov rdx, QWORD PTR [rax+rdx*8]
movsx   rsi, esi
vmovsd  xmm0, QWORD PTR [rdx]
mov rax, QWORD PTR [rax+rsi*8]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
vmovsd  xmm0, QWORD PTR [rdx]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
.L3:
ret
[/asm]

[Bug middle-end/88490] Missed autovectorization when indices are different

2018-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

--- Comment #3 from Daniel Fruzynski  ---
In this case s->d is pointer to pointer to double, and both pointer levels have
restrict qualifier. I wonder if you could add some tag that s->d[n] and s->d[k]
points to separate memory areas. This tag could be later used to determine that
s->d[n][0] and s->d[k][0] also do not overlap.

[Bug middle-end/88487] union prevents autovectorization

2018-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #4 from Daniel Fruzynski  ---
OK, I see. Is there any workaround for this? I tried to assign pointer to local
variable directly and with intermediate casting via void*, but it did not help.
Casting S1* to S2* also does not work.

[Bug c/88540] New: Issues with vectorization of min/max operations

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540

Bug ID: 88540
   Summary: Issues with vectorization of min/max operations
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

1st issue:

[code]
#define SIZE 2

void test(double* __restrict d1, double* __restrict d2, double* __restrict d3)
{
for (int n = 0; n < SIZE; ++n)
{
d3[n] = d1[n] < d2[n] ? d1[n] : d2[n];
}
}
[code]

When this is compiled with for SSE2, gcc produces non vectorized code:

[asm]
test(double*, double*, double*):
vmovsd  xmm0, QWORD PTR [rdi]
vminsd  xmm0, xmm0, QWORD PTR [rsi]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rdi+8]
vminsd  xmm0, xmm0, QWORD PTR [rsi+8]
vmovsd  QWORD PTR [rdx+8], xmm0
ret
[/asm]

When SIZE is changed to 3 or greater, code gets vectorized properly. I thought
that this may be some workaround for old CPU which was slower there, but this
also happen when compiling with "-O3 -march=skylake". I also checked with SIZE
6, and got 1 AVX op and 2 scalar SSE ones. Looks that this is an off-by-one
bug.

The same happen for code with other relational operators (>, <=, >=).

2nd issue: when compiling for AVX512, gcc does not use new instructions which
use ZMM registers, it still generates code for YMM ones.

[Bug c/88542] New: Optimize symmetric range check

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542

Bug ID: 88542
   Summary: Optimize symmetric range check
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

bool test1(double d, double max)
{
return (d < max) && (d > -max);
}

bool test2(double d, double max)
{
return fabs(d) < max;
}
[/code]

When code checks if some number d is in (or outside of) symmetric range like
(-max, max), code from test1() can be replaced with one from test2(). This of
course assumes that expression does not produce any side effects. This can be
done nicely for floating point numbers stored in IEEE format, what leads to
faster code:

[asm]
test1(double, double):
vcomisd xmm1, xmm0
jbe .L6
vxorpd  xmm1, xmm1, XMMWORD PTR .LC0[rip]
vcomisd xmm0, xmm1
setaal
ret
.L6:
xor eax, eax
ret
test2(double, double):
vandpd  xmm0, xmm0, XMMWORD PTR .LC1[rip]
vcomisd xmm1, xmm0
setaal
ret
[/asm]

For integer types stored in two's complement format similar change gives slower
code. However on platforms which uses different integer format with dedicated
sign bit this optimizations may be beneficial.

[Bug middle-end/88542] Optimize symmetric range check

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542

--- Comment #2 from Daniel Fruzynski  ---
No, code with -ffast-math is the same.

BTW, fabs(NaN) is NaN, so result is the same as before (false).

[Bug tree-optimization/88540] Issues with vectorization of min/max operations

2018-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540

--- Comment #3 from Daniel Fruzynski  ---
Looks that AARCH64 is also affected. This is output from gcc 8.2 for SIZE=2:

[asm]
test(double*, double*, double*):
ldp d1, d0, [x0]
ldp d3, d2, [x1]
fcmpe   d1, d3
fcsel   d1, d1, d3, mi
fcmpe   d0, d2
fcsel   d0, d0, d2, mi
stp d1, d0, [x2]
ret
[/asm]

And this is for SIZE=4:

[asm]
test(double*, double*, double*):
ldr q5, [x0]
ldr q3, [x1]
ldr q4, [x0, 16]
ldr q2, [x1, 16]
fcmgt   v1.2d, v3.2d, v5.2d
fcmgt   v0.2d, v2.2d, v4.2d
bsl v1.16b, v5.16b, v3.16b
bsl v0.16b, v4.16b, v2.16b
str q1, [x2]
str q0, [x2, 16]
ret
[/asm]

[Bug middle-end/88487] union prevents autovectorization

2018-12-20 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #6 from Daniel Fruzynski  ---
Not good. Fortunately I found workaround. This is probably the best what one
can get:

[code]
#include 
#include 

template
struct TypeHelper
{
constexpr unsigned offset();

operator Type&()
{
uint8_t*__restrict p = (uint8_t*__restrict)this - offset();
Type*__restrict pt =  (Type*__restrict)p;
return *pt;
}
};

struct S
{
struct Union
{
void*__restrict*__restrict ptr;
TypeHelper d;
} u;
};

template<>
constexpr unsigned TypeHelper::offset()
{
return offsetof(S::Union, d) - offsetof(S::Union, ptr);
}

void test(S* __restrict s1, S* __restrict s2)
{
for (int n = 0; n < 2; ++n)
{
s1->u.d[n][0] = s2->u.d[n][0];
s1->u.d[n][1] = s2->u.d[n][1];
}
}
[/code]

[Bug middle-end/88569] New: Track relations between variable values

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88569

Bug ID: 88569
   Summary: Track relations between variable values
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This example comes from code which could be compiled for various CPUs, and had
dedicated sections for AVX and SSE2. I left original ifdefs in comments. When
1st loop (for AVX) ends, following relations is true: (cnt - n <= 3). Similarly
after 2nd loop this is true: (cnt - n <= 1). With such knowledge it is possible
to optimize code of bar() to baz(). This eliminates two condition checks (after
2nd and 3rd loop), and one increment (for 3rd loop). It would be nice if gcc
could perform such transformation automatically.

[code]
void foo(int n);

void bar(int cnt)
{
int n = 0;
//#ifdef __AVX__
for (; n < cnt - 3; n += 4)
foo(n);
//#endif
//#ifdef __SSE2__
for (; n < cnt - 1; n += 2)
foo(n);
//#endif
for (; n < cnt; n += 1)
foo(n);
}

void baz(int cnt)
{
int n = 0;
for (; n < cnt - 3; n += 4)
foo(n);
if (n < cnt - 1)
{
foo(n);
n += 2;
}
if (n < cnt)
foo(n);
}
[/code]

[asm]
bar(int):
pushr13
pushr12
mov r12d, edi
pushrbp
lea ebp, [rdi-3]
pushrbx
xor ebx, ebx
sub rsp, 8
testebp, ebp
jle .L5
.L2:
mov edi, ebx
add ebx, 4
callfoo(int)
cmp ebx, ebp
jl  .L2
lea eax, [r12-4]
shr eax, 2
lea ebx, [4+rax*4]
.L5:
lea ebp, [r12-1]
cmp ebp, ebx
jle .L3
mov edi, ebx
lea r13d, [rbx+2]
callfoo(int)
cmp ebp, r13d
jle .L8
mov edi, r13d
callfoo(int)
.L8:
lea edi, [r12-2]
sub edi, ebx
mov ebx, edi
and ebx, -2
add ebx, r13d
.L3:
cmp r12d, ebx
jle .L14
mov edi, ebx
callfoo(int)
lea edi, [rbx+1]
cmp r12d, edi
jg  .L17
.L14:
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
ret
.L17:
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
jmp foo(int)
baz(int):
pushr12
mov r12d, edi
pushrbp
lea ebp, [rdi-3]
pushrbx
xor ebx, ebx
testebp, ebp
jle .L19
.L20:
mov edi, ebx
add ebx, 4
callfoo(int)
cmp ebx, ebp
jl  .L20
lea eax, [r12-4]
shr eax, 2
lea ebx, [4+rax*4]
.L19:
lea eax, [r12-1]
cmp eax, ebx
jg  .L27
cmp ebx, r12d
jl  .L28
.L25:
pop rbx
pop rbp
pop r12
ret
.L27:
mov edi, ebx
add ebx, 2
callfoo(int)
cmp ebx, r12d
jge .L25
.L28:
mov edi, ebx
pop rbx
pop rbp
pop r12
jmp foo(int)
[/asm]

[Bug middle-end/88570] New: Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

Bug ID: 88570
   Summary: Missing or ineffective vectorization of scatter load
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void test1(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
for (int n = 0; n < 8; ++n)
{
if (n1[n] > 0)
n2[n] = n3[n];
else
n2[n] = n4[n];
}
}

void test2(double*__restrict d1, double*__restrict d2,
double*__restrict d3, double*__restrict d4)
{
for (int n = 0; n < 4; ++n)
{
if (d1[n] > 0.0)
d2[n] = d3[n];
else
d2[n] = d4[n];
}
}
[/code]

Code like above is vectorized properly when global variables are used. However
when code has to work on pointers passed as function arguments, vectorization
is not performed or performed ineffectively.

1. Compilation with -O3 -msse2: no vectorization at all, scalar code is
generated. It is long so I do not paste it here.

2. Compilation with -O3 -msse4.1: no vectorization at all

3. Compilation with -O3 -mavx or -march=sandybridge: code for test1() is still
not vectorized (somewhat expected, as int operations are in AVX2). Output for
test2() is below. As you can see, generated code performs masked loads for d3
and d4, and then used blend to create final result. When global vars are used,
masked loads are not used, only blend. Additionally xor mask is loaded from
memory instead of using cmpeq instruction.

[asm]
test2(double*, double*, double*, double*):
vmovupd xmm3, XMMWORD PTR [rdi]
vinsertf128 ymm1, ymm3, XMMWORD PTR [rdi+16], 0x1
vxorpd  xmm0, xmm0, xmm0
vcmpltpdymm1, ymm0, ymm1
vmaskmovpd  ymm2, ymm1, YMMWORD PTR [rdx]
vxorps  ymm0, ymm1, YMMWORD PTR .LC0[rip]
vmaskmovpd  ymm0, ymm0, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovups XMMWORD PTR [rsi], xmm0
vextractf128XMMWORD PTR [rsi+16], ymm0, 0x1
vzeroupper
ret
.LC0:
.quad   -1
.quad   -1
.quad   -1
.quad   -1
[/asm]

4. Compilation with -O3 -march=haswell: code similar as above, with both masked
loads and blend. This time compiler generated vpcmpeqd to load xor mask. This
also happen when -mavx2 is used instead of -march=haswell.

[asm]
test1(int*, int*, int*, int*):
vmovdqu ymm1, YMMWORD PTR [rdi]
vpxor   xmm0, xmm0, xmm0
vpcmpgtdymm1, ymm1, ymm0
vpmaskmovd  ymm2, ymm1, YMMWORD PTR [rdx]
vpcmpeqdymm0, ymm1, ymm0
vpmaskmovd  ymm0, ymm0, YMMWORD PTR [rcx]
vpblendvb   ymm0, ymm0, ymm2, ymm1
vmovdqu YMMWORD PTR [rsi], ymm0
vzeroupper
ret
test2(double*, double*, double*, double*):
vxorpd  xmm0, xmm0, xmm0
vcmpltpdymm1, ymm0, YMMWORD PTR [rdi]
vpcmpeqdymm0, ymm0, ymm0
vmaskmovpd  ymm2, ymm1, YMMWORD PTR [rdx]
vpxor   ymm0, ymm0, ymm1
vmaskmovpd  ymm0, ymm0, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
[/asm]

4. Compilation with -O3 -march=skylake-avx512: masked loads and blend used
again. This time masked loads uses kN registers to store mask. test1() performs
comparison twice to get negated value. test2() uses single comparison, but to
negate it it moves value to eax and then back (I will log a separate bug for
this part, as it has other implications). Code which uses global variables only
uses blend with mask in ymm register.

[asm]
test1(int*, int*, int*, int*):
vmovdqu32   ymm0, YMMWORD PTR [rdi]
vpxor   xmm2, xmm2, xmm2
vpcmpd  k1, ymm0, ymm2, 6
vpcmpgtdymm3, ymm0, ymm2
vmovdqu32   ymm1{k1}{z}, YMMWORD PTR [rdx]
vpcmpd  k1, ymm0, ymm2, 2
vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rcx]
vpblendvb   ymm0, ymm0, ymm1, ymm3
vmovdqu32   YMMWORD PTR [rsi], ymm0
vzeroupper
ret
test2(double*, double*, double*, double*):
vmovupd ymm0, YMMWORD PTR [rdi]
vxorpd  xmm1, xmm1, xmm1
vcmppd  k1, ymm0, ymm1, 14
vcmpltpdymm1, ymm1, ymm0
kmovb   eax, k1
not eax
vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx]
kmovb   k2, eax
vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
[/asm]

5. I tried to compile this code using icc, and got this. As you can see, it
uses masked move instead of blend. I did not check if it o

[Bug target/88571] New: AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

Bug ID: 88571
   Summary: AVX512: when calculating logical expression with all
values in kN registers, do not use GPRs
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This is a side effect of finding Bug 88570. I have noticed that when gcc has to
generate code for logical expression with all values already stored in kN
registers, it moves them to GPRs, performs calculation on them and moved result
back. Such situation may happen as a side effect of optimizations in gcc. It is
also move convenient to use C/C++ operators to write expressions instead of
intrinsics, so some people may prefer to use them. It probably can also happen
as a side effect of interaction of code optimized by gcc with user code.

When logical expression is written using intrinsics, values stays in kN
registers as expected.

Code below was compiled with -O3 -march=skylake-avx512. test1 and test2 are
examples of code with C/C++ operators. test3 is an example of not introduced by
gcc during optimization. This last example is also in Bug 88570, which I logged
to fix inefficient optimizations.

[code]
#include 

void test1(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
__m256i v = _mm256_loadu_si256((__m256i*)n1);
__mmask8 m = _mm256_cmpgt_epi32_mask(v, _mm256_set1_epi32(1));
m = ~m;
_mm256_mask_storeu_epi32((__m256i*)n2, m, v);
}

void test2(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
__m256i v1 = _mm256_loadu_si256((__m256i*)n1);
__m256i v2 = _mm256_loadu_si256((__m256i*)n1);
__m256i v0 = _mm256_set1_epi32(2);
__mmask8 m1 = _mm256_cmpgt_epi32_mask(v1, _mm256_set1_epi32(1));
__mmask8 m2 = _mm256_cmpgt_epi32_mask(v2, _mm256_set1_epi32(2));
__mmask8 m = ~(m1 | m2);
_mm256_mask_storeu_epi32((__m256i*)n2, m, v1);
}

void test3(double*__restrict d1, double*__restrict d2,
double*__restrict d3, double*__restrict d4)
{
for (int n = 0; n < 4; ++n)
{
if (d1[n] > 0.0)
d2[n] = d3[n];
else
d2[n] = d4[n];
}
}
[/code]

[asm]
test1(int*, int*, int*, int*):
vmovdqu64   ymm0, YMMWORD PTR [rdi]
vpcmpgtdk1, ymm0, YMMWORD PTR .LC0[rip]
kmovb   eax, k1
not eax
kmovb   k2, eax
vmovdqu32   YMMWORD PTR [rsi]{k2}, ymm0
vzeroupper
ret
test2(int*, int*, int*, int*):
vmovdqu64   ymm1, YMMWORD PTR [rdi]
vpcmpgtdk1, ymm1, YMMWORD PTR .LC0[rip]
vpcmpgtdk2, ymm1, YMMWORD PTR .LC1[rip]
kmovb   edx, k1
kmovb   eax, k2
or  eax, edx
not eax
kmovb   k3, eax
vmovdqu32   YMMWORD PTR [rsi]{k3}, ymm1
vzeroupper
ret
test3(double*, double*, double*, double*):
vmovupd ymm0, YMMWORD PTR [rdi]
vxorpd  xmm1, xmm1, xmm1
vcmppd  k1, ymm0, ymm1, 14
vcmpltpdymm1, ymm1, ymm0
kmovb   eax, k1
not eax
vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx]
kmovb   k2, eax
vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
.LC0:
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.LC1:
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
[/asm]

[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

--- Comment #2 from Daniel Fruzynski  ---
Yes. Issue still exists in g++ (GCC-Explorer-Build) 9.0.0 20181219
(experimental).

[Bug target/88570] Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

--- Comment #2 from Daniel Fruzynski  ---
In g++ (GCC-Explorer-Build) 9.0.0 20181219 (experimental) this still exists.

[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

--- Comment #3 from Daniel Fruzynski  ---
I have checked svn head version (20181221), issue is still there.

[Bug target/88570] Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

--- Comment #3 from Daniel Fruzynski  ---
I have checked svn head version (20181221), issue is still there.

[Bug middle-end/88575] New: gcc got confused by different comparison operators

2018-12-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

Bug ID: 88575
   Summary: gcc got confused by different comparison operators
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

In test() gcc is not able to determine that for a==b it does not have to
evaluate 2nd comparison and can use value of a if 1st comparison is true. When
operators are swapped like in test2() or are the same, code is optimized.

[code]
double test(double a, double b)
{
if (a <= b)
return a < b ? a : b;
return 0.0;
}

double test2(double a, double b)
{
if (a < b)
return a <= b ? a : b;
return 0.0;
}
[/code]

[asm]
test(double, double):
  vcomisd xmm1, xmm0
  jnb .L10
  vxorpd xmm0, xmm0, xmm0
  ret
.L10:
  vminsd xmm0, xmm0, xmm1
  ret

test2(double, double):
  vcmpnltsd xmm1, xmm0, xmm1
  vxorpd xmm2, xmm2, xmm2
  vblendvpd xmm0, xmm0, xmm2, xmm1
  ret
[/asm]

[Bug middle-end/88575] gcc got confused by different comparison operators

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

--- Comment #2 from Daniel Fruzynski  ---
Code was compiled with -O3 -march=skylake.

I have tried to add -fno-signed-zeros and -fsigned-zeros, and got the same
output for both cases.

[Bug middle-end/88575] gcc got confused by different comparison operators

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

--- Comment #3 from Daniel Fruzynski  ---
I have tried to compile with -O3 -march=skylake -ffast-math and got this:

[asm]
test(double, double):
vminsd  xmm2, xmm0, xmm1
vcmplesdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
test2(double, double):
vminsd  xmm2, xmm0, xmm1
vcmpltsdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
[/asm]

And this is for -O3 -march=skylake -funsafe-math-optimizations. As you can see,
one instruction was eliminated from test2(). For some reason it was not
eliminated from test() function. I checked that -ffinite-math-only present in
-ffast-math prevented elimination of this extra instruction.

[asm]
test(double, double):
vminsd  xmm2, xmm0, xmm1
vcmplesdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
test2(double, double):
vcmpnltsd   xmm1, xmm0, xmm1
vxorpd  xmm2, xmm2, xmm2
vblendvpd   xmm0, xmm0, xmm2, xmm1
ret
[/asm]

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

Daniel Fruzynski  changed:

   What|Removed |Added

 CC||bugzilla@poradnik-webmaster
   ||a.com

--- Comment #3 from Daniel Fruzynski  ---
Cygwin (x86_64-pc-cygwin) is also affected. I have encountered this bug on gcc
7.4.0.

Could you add new option which would remove XMM16+ registers from available
registers pool? It could be used as an easy to use workaround until you fix it
properly.

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

--- Comment #4 from Daniel Fruzynski  ---
I have found that I can use -ffixed-reg option for this. It allows to eliminate
one register, so I have to use it 16 times to eliminate all xmm16..31
registers. It would be handy to have another option which would allow to
disable all registers from this group together.

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

--- Comment #5 from Daniel Fruzynski  ---
I got following link:
https://stackoverflow.com/questions/53733624/is-xmm8-register-value-preserved-across-calls/53733767#53733767

Quote from it: "Any additional registers for newer instruction sets are
volatile by default. This includes the upper parts of YMM0-15 and ZMM0-15 as
well as ?MM16-31 if present.".

So it looks that gcc should not generate .seh_savexmm for xmm16..31 at all.

[Bug c++/87729] Please include -Woverloaded-virtual in -Wall

2019-01-02 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87729

--- Comment #2 from Daniel Fruzynski  ---
Here you are:

[code]
class Foo
{
public:
virtual void f(int);
};

class Bar : public Foo
{
public:
virtual void f(short);
};
[/code]

[Bug c/88679] New: SSE2 intrinsics are available by default on x86

2019-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679

Bug ID: 88679
   Summary: SSE2 intrinsics are available by default on x86
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

SSE2 intrinsics are available by default when compiling code for 32-bit x86.
Code below compiles fine with options -m32 -O3. I had to add -mno-sse2 to get
an error. 

Fortunately __SSE2__ is not defined by default, so code can rely on it.

[code]
#include 

void test(__m128i const* m)
{
__m128i v = _mm_load_si128(m);
}
[/code]

[Bug target/88679] SSE2 intrinsics are available by default on x86

2019-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679

--- Comment #2 from Daniel Fruzynski  ---
I used compiler at https://godbolt.org/. Here are outputs for both commands:

$ gcc -v
Using built-in specs.

COLLECT_GCC=/opt/compiler-explorer/gcc-snapshot/bin/g++

Target: x86_64-linux-gnu

Configured with: ../gcc-trunk-20190103/configure
--prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap
--enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran
--enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto
--enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build

Thread model: posix

gcc version 9.0.0 20190102 (experimental) (GCC-Explorer-Build) 

COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o'
'/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s' '-masm=intel'
'-S' '-v' '-shared-libgcc' '-mtune=generic' '-march=x86-64'


/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/cc1plus
-quiet -v -imultiarch x86_64-linux-gnu -iprefix
/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/
-D_GNU_SOURCE  -quiet -dumpbase example.cpp -masm=intel -mtune=generic
-march=x86-64 -auxbase-strip
/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s -g -version
-fdiagnostics-color=always -o
/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s

GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental)
(x86_64-linux-gnu)

compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4,
MPC version 1.0.3, isl version isl-0.18-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096

ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include"

ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed"

ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include"

#include "..." search starts here:

#include <...> search starts here:


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed

 /usr/local/include

 /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../include

 /usr/include/x86_64-linux-gnu

 /usr/include

End of search list.

GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental)
(x86_64-linux-gnu)

compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4,
MPC version 1.0.3, isl version isl-0.18-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096

Compiler executable checksum: f724e483fb841047a948ffa41ca3218a

COMPILER_PATH=/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/bin/

LIBRARY_PATH=/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../lib64/:/lib/x86_64-linux-gnu/:/lib/../lib64/:/usr/lib/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/..

[Bug target/71659] _xgetbv intrinsic missing

2019-01-17 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659

Daniel Fruzynski  changed:

   What|Removed |Added

 CC||bugzilla@poradnik-webmaster
   ||a.com

--- Comment #4 from Daniel Fruzynski  ---
This intrinsics was added in gcc 8. Initial implementation was buggy (see
r85684) and was fixed in 8.2 However there is one more issue here: Intel
Intrinsics Guide says that it should be available by including ,
however in gcc you need to include .

Additionally there are no defines for XFEATURE_ENABLED_MASK and possible output
values.

[Bug target/71659] _xgetbv intrinsic missing

2019-01-17 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659

--- Comment #5 from Daniel Fruzynski  ---
I meant pr85684

[Bug c/88959] New: Unnecessary xor before bsf/tzcnt

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959

Bug ID: 88959
   Summary: Unnecessary xor before bsf/tzcnt
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
int test(int x)
{
return __builtin_ctz(x);
}
[/code]

gcc 4.9.1 with -O3 produces this:

[asm]
test(int):
  rep bsf eax, edi
  ret
[/asm]

And this with -O3 -mbmi:

[asm]
test(int):
  tzcnt eax, edi
  ret
[/asm]

gcc 4.9.2 and newer (including gcc 9) produces this for both cases:

[asm]
test(int):
  xor eax, eax
  rep bsf eax, edi
  ret
[/asm]

[asm]
test(int):
  xor eax, eax
  tzcnt eax, edi
  ret
[/asm]

This extra xor instruction is not needed here.

[Bug c/88959] Unnecessary xor before bsf/tzcnt

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959

--- Comment #1 from Daniel Fruzynski  ---
I have found that this extra xor is not added when compiling with -O3
-march=sandybridge or -O3 -march=ivydybridge. However with -O3
-march=sandybridge/ivydybridge -mbmi it is added.

[Bug c/88963] New: gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

Bug ID: 88963
   Summary: gcc generates terrible code for vectors of 64+ length
which are not natively supported
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
typedef int VInt __attribute__((vector_size(64)));

void test(VInt*__restrict a, VInt*__restrict b, 
VInt*__restrict c)
{
*a = *b + *c;
}
[/code]

This code compiled with -O3 -march=skylake in following way:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*):
  push rbp
  mov rbp, rsp
  and rsp, -64
  sub rsp, 136
  vmovdqa xmm3, XMMWORD PTR [rsi]
  vmovdqa xmm4, XMMWORD PTR [rsi+16]
  vmovdqa xmm5, XMMWORD PTR [rsi+32]
  vmovdqa xmm6, XMMWORD PTR [rsi+48]
  vmovdqa xmm7, XMMWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp-56], xmm3
  vmovdqa xmm1, XMMWORD PTR [rdx+16]
  vmovaps XMMWORD PTR [rsp-40], xmm4
  vmovdqa ymm4, YMMWORD PTR [rsp-56]
  vmovdqa xmm2, XMMWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rsp-8], xmm6
  vmovaps XMMWORD PTR [rsp+8], xmm7
  vmovdqa xmm3, XMMWORD PTR [rdx+48]
  vmovaps XMMWORD PTR [rsp-24], xmm5
  vmovaps XMMWORD PTR [rsp+24], xmm1
  vpaddd ymm0, ymm4, YMMWORD PTR [rsp+8]
  vmovdqa ymm5, YMMWORD PTR [rsp-24]
  vmovaps XMMWORD PTR [rsp+40], xmm2
  vmovaps XMMWORD PTR [rsp+56], xmm3
  vmovdqa xmm2, xmm0
  vmovdqa YMMWORD PTR [rsp-120], ymm0
  vpaddd ymm0, ymm5, YMMWORD PTR [rsp+40]
  vmovdqa xmm6, XMMWORD PTR [rsp-104]
  vmovdqa YMMWORD PTR [rsp-88], ymm0
  vmovdqa xmm7, XMMWORD PTR [rsp-72]
  vmovaps XMMWORD PTR [rdi], xmm2
  vmovaps XMMWORD PTR [rdi+16], xmm6
  vmovaps XMMWORD PTR [rdi+32], xmm0
  vmovaps XMMWORD PTR [rdi+48], xmm7
  vzeroupper
  leave
  ret
[/asm]

Other compilers (clang, icc) produces nice code. This is from clang:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*): # @test(int
__vector(16)*, int __vector(16)*, int __vector(16)*)
  vmovdqa ymm0, ymmword ptr [rdx]
  vmovdqa ymm1, ymmword ptr [rdx + 32]
  vpaddd ymm0, ymm0, ymmword ptr [rsi]
  vpaddd ymm1, ymm1, ymmword ptr [rsi + 32]
  vmovdqa ymmword ptr [rdi + 32], ymm1
  vmovdqa ymmword ptr [rdi], ymm0
  vzeroupper
  ret
[/asm]

gcc produces pretty code for -O3 -march=skylake-avx512. Pretty code is also for
vector size 32 with AVX disabled. However for vector size 128 and -O3
-march=skylake-avx512 code is again ugly.

[Bug c++/91235] New: Array size expression is implicitly casted to unsigned long type

2019-07-23 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235

Bug ID: 91235
   Summary: Array size expression is implicitly casted to unsigned
long type
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void foo(char*);

inline void bar(int n)
{
if (__builtin_constant_p(n))
{
char a[(int)(n == 2 ? -1 : 0)];
foo(a);
}
}

void baz()
{
bar(2);
}
[/code]

When this is compiled with -O3 -Wall -Wextra -std=c++11 (tested via
godbolt.org), it produces following code:

[asm]
baz():
  push rbp
  mov rbp, rsp
  mov rdi, rsp
  call foo(char*)
  leave
  ret
[/asm]

During compilation gcc reported following warning:
[out]
: In function 'void baz()':

:7:14: warning: argument to variable-length array is too large
[-Wvla-larger-than=]

7 | char a[(int)(n == 2 ? -1 : 0)];

  |  ^

:7:14: note: limit is 9223372036854775807 bytes, but argument is
18446744073709551615

Compiler returned: 0
[out]

This means that gcc saw that n is constant, and then expression specified as
array size was evaluated and implicitly casted to unsigned type.

When I removed "foo(a);" line, this warning is gone, and gcc warned about
unused variable.

When -1 is specified as array size, it correctly report error that array size
is negative. Looks that only expressions causes this issue.

[Bug c++/91235] Array size expression is implicitly casted to unsigned long type

2019-08-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235

--- Comment #1 from Daniel Fruzynski  ---
I checked that trunk gcc also accepts this code, both with -std=c++11 and
-std=c++1z. Clang also compiles this without error. Could someone take a look
on this and add some comment here?

[Bug c/83369] New: Missing diagnostics during inlining

2017-12-11 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83369

Bug ID: 83369
   Summary: Missing diagnostics during inlining
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When code below is compiled, gcc prints warnings that null is passed to
function with nonnull attribute. However gcc does not point that error is
caused by inlining of my_strcpy at line 32 of test.cc.

Code was compiled using gcc (GCC) 8.0.0 20171210 (experimental).

[code]
#include 

char buf[100];

struct Test
{
const char* s1;
const char* s2;
};

__attribute((nonnull(1, 2)))
inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src,
size_t size)
{
size_t len = strlen(src);
if (len < size)
memcpy(dst, src, len + 1);
else
{
memcpy(dst, src, size - 1);
dst[size - 1] = '\0';
}
return dst;
}

void test(Test* test)
{
if (test->s1)
my_strcpy(buf, test->s1, sizeof(buf));
else if (test->s2)
my_strcpy(buf, test->s2, sizeof(buf));
else
my_strcpy(buf, test->s2, sizeof(buf)); // error, line 32
}
[/code]

[out]
$ g++ -c -o test.o test.cc -O2 -Wall
test.cc: In function ‘void test(Test*)’:
test.cc:14:24: warning: argument 1 null where non-null expected [-Wnonnull]
 size_t len = strlen(src);
  ~~^
In file included from test.cc:1:
/usr/include/string.h:395:15: note: in a call to function ‘size_t strlen(const
char*)’ declared here
 extern size_t strlen (const char *__s)
   ^~
test.cc:16:15: warning: argument 2 null where non-null expected [-Wnonnull]
 memcpy(dst, src, len + 1);
 ~~^~~
In file included from test.cc:1:
/usr/include/string.h:42:14: note: in a call to function ‘void* memcpy(void*,
const void*, size_t)’ declared here
 extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
  ^~
test.cc:19:15: warning: argument 2 null where non-null expected [-Wnonnull]
 memcpy(dst, src, size - 1);
 ~~^~~~
In file included from test.cc:1:
/usr/include/string.h:42:14: note: in a call to function ‘void* memcpy(void*,
const void*, size_t)’ declared here
 extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
  ^~
[/out]

[Bug c/83373] New: False positive reported by -Wstringop-overflow

2017-12-11 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373

Bug ID: 83373
   Summary: False positive reported by -Wstringop-overflow
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When code below is compiled, gcc incorrectly complains that memcpy will read
data after end of buffer in line marked with star. Looks that gcc does not take
into account that 'if' above protects against this.

Code was compiles using gcc (GCC) 8.0.0 20171210 (experimental).

[code]
#include 

char buf[100];

void get_data(char* ptr);

__attribute((nonnull(1, 2)))
inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src,
size_t size)
{
size_t len = strlen(src);
if (len < size)
memcpy(dst, src, len + 1);
else
{
memcpy(dst, src, size - 1); //*
dst[size - 1] = '\0';
}
return dst;
}

void test()
{
char data[20];
get_data(data);
my_strcpy(buf, data, sizeof(buf));
}
[/code]

[out]
$ g++ -c -o test.o test.cc -O2 -Wall
In function ‘char* my_strcpy(char*, const char*, size_t)’,
inlined from ‘void test()’ at test.cc:25:14:
test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ reading 99
bytes from a region of size 20 [-Wstringop-overflow=]
 memcpy(dst, src, size - 1); //*
 ~~^~~~
[/out]

[Bug middle-end/81914] [7/8 Regression] gcc 7.1 generates branch for code which was branchless in earlier gcc version

2017-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81914

--- Comment #9 from Daniel Fruzynski  ---
In the meantime I found another case when gcc 7 inserts lots of jumps. I am not
sure if your extra test cases covers it too:

#include 

int test(int data1[9][9], int data2[9][9])
{
  uint64_t b1 = 0, b2 = 0;
  for (int n = 0; n < 9; ++n)
  {
for (int k = 0; k < 9; ++k)
{
  int a = data1[n][k] * 9 + data2[n][k];
  (a < 64 ? b1 : b2) |= 1 << (a & 63);
}
  }
  return __builtin_popcount(b1) + __builtin_popcount(b2);
}

[Bug middle-end/83373] False positive reported by -Wstringop-overflow

2017-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373

--- Comment #4 from Daniel Fruzynski  ---
> Bug 83373 - False positive reported by -Wstringop-overflow, is
> another example of warning triggered by a missed optimization
> opportunity, this time in the strlen pass.  The optimization
> is discussed in pr78450 - strlen(s) return value can be assumed
> to be less than the size of s.  The gist of it is that the result
> of strlen(array) can be assumed to be less than the size of
> the array (except in the corner case of last struct members).

This approach is not good from my perspective. I have structs used for IPC
purposes, and I cannot reorder fields or add a new one at the end to silence
this warning. Better approach would be to explicitly mark structure as flexible
width with special attribute, and use this approach only for structures marked
this way. As you wrote, this is a corner case, so requiring this attribute
there can be accepted. This can also improve diagnostics in other cases, if you
use similar approach for other warnings too.

[Bug middle-end/83373] False positive reported by -Wstringop-overflow

2017-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373

--- Comment #6 from Daniel Fruzynski  ---
My understanding is that after this patch will be applied, gcc will still emit
warning for last field in struct, e.g. like in code below. Is my understanding
correct or I missed something?

struct Msg
{
  int op;
  char str1[100];
  char str2[100];
};

...

void func()
{
  Msg msg;
  msg.op = 5;

  char data1[20], data2[20];
  get_data(data1);
  get_data(data2);

  my_strcpy(msg.str1, data1, sizeof(msg.str1)); // OK, no warning
  my_strcpy(msg.str2, data2, sizeof(msg.str2)); // Warning still present

  send_msg(&msg, sizeof(msg));
}

[Bug middle-end/83373] False positive reported by -Wstringop-overflow

2017-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373

--- Comment #7 from Daniel Fruzynski  ---
In my case structures like Msg above are generated from IDL files together with
code for serialization and deserialization. Because of this I cannot freely
move or add new fields there, this may break compatibility.

[Bug middle-end/83373] False positive reported by -Wstringop-overflow

2017-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373

--- Comment #9 from Daniel Fruzynski  ---
Thanks for explanation. In addition to allocation on stack, my app also uses
custom allocator function like below. So in this case it also should work as
expected.

void* msg_alloc(int msg_id);
...

Msg* msg = (Msg*)msg_alloc(ID_OF_MSG);
...

Anyway, this new attribute looks useful for me, it probably could allow better
diagnostics and optimization. However treating all [sub]objects without this
attribute as a fixed size may break some existing code, so extra command line
switch to enable old (current) behavior also would be needed. All of this
probably needs separate issue here to track it.

[Bug c++/83429] New: Incorrect line number reported by -Wformat-truncation

2017-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429

Bug ID: 83429
   Summary: Incorrect line number reported by -Wformat-truncation
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

struct S
{
char str1[10];
char str2[10];
char out[15];
};

void test(S* s) // line 10
{
snprintf(s->out, sizeof(s->out), "%s.%s", s->str1, s->str2); // line 12
}
[/code]

When above code is compiles using "g++ -c -o test.o test.cc -O2 -Wall", it
produces following output:

[out]
test.cc: In function ‘void test(S*)’:
test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 9
bytes into a region of size between 5 and 14 [-Wformat-truncation=]
 void test(S* s) // line 10
  ^~~~
test.cc:12:13: note: ‘snprintf’ output between 2 and 20 bytes into a
destination of size 15
 snprintf(s->out, sizeof(s->out), "%s.%s", s->str1, s->str2); // line 12
 ^~~
[/out]

As you can see, line number in "warning:" line is incorrect - it points to line
with function name. Fortunately correct number is in line with "note:". However
when code is compiled with -D_FORTIFY_SOURCE=1 added, you loose this important
piece of information:

[out]
test.cc: In function ‘void test(S*)’:
test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 9
bytes into a region of size between 5 and 14 [-Wformat-truncation=]
 void test(S* s) // line 10
  ^~~~
In file included from /usr/include/stdio.h:937,
 from test.cc:1:
/usr/include/bits/stdio2.h:64:35: note: ‘__builtin_snprintf’ output between 2
and 20 bytes into a destination of size 15
   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
  ~^~~
__bos (__s), __fmt, __va_arg_pack ());
~
[/out]

g++ --version
g++ (GCC) 8.0.0 20171210 (experimental)

[Bug c++/83430] New: buffer overflow diagnostics for snprintf is broken

2017-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83430

Bug ID: 83430
   Summary: buffer overflow diagnostics for snprintf is broken
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

struct S
{
char str[20];
char out[15];
};

void test(S* s)
{
snprintf(s->out, sizeof(s->str), "[%s]", s->str);
}
[/code]

[out]
$ g++ -c -o test.o test.cc -O2 -Wall
test.cc: In function ‘void test(S*)’:
test.cc:9:6: warning: ‘]’ directive output may be truncated writing 1 byte into
a region of size between 0 and 19 [-Wformat-truncation=]
 void test(S* s)
  ^~~~
test.cc:11:13: note: ‘snprintf’ output between 3 and 22 bytes into a
destination of size 20
 snprintf(s->out, sizeof(s->str), "[%s]", s->str);
 ^~~~
[/out]

There are two problems there:
- snprintf does not detect that actual size of out is 15 bytes, not 20;
- code passes size of one of input arguments which will be part of output
string instead of output buffer size.

Output for compilation with -D_FORTIFY_SOURCE=2 has the same problems.

g++ --version
g++ (GCC) 8.0.0 20171210 (experimental)

[Bug c++/83431] New: -Wformat-truncation may incorrectly report truncation

2017-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83431

Bug ID: 83431
   Summary: -Wformat-truncation may incorrectly report truncation
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This looks like another missing optimization - -Wformat-truncation does not
take into account that there is "if" which checks that truncation will not
happen.

[code]
#include 
#include 

struct S
{
char str[20];
char out[10];
};

void test(S* s)
{
if (strlen(s->str) < sizeof(s->out) - 2)
snprintf(s->out, sizeof(s->out), "[%s]", s->str);
}
[/code]

[out]
$ g++ -c -o test.o test.cc -O2 -Wall
test.cc: In function ‘void test(S*)’:
test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 19
bytes into a region of size 9 [-Wformat-truncation=]
 void test(S* s)
  ^~~~
test.cc:13:17: note: ‘snprintf’ output between 3 and 22 bytes into a
destination of size 10
 snprintf(s->out, sizeof(s->out), "[%s]", s->str);
 ^~~~
[/out]

g++ --version
g++ (GCC) 8.0.0 20171210 (experimental)

[Bug middle-end/59521] __builtin_expect not effective in switch

2017-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521

Daniel Fruzynski  changed:

   What|Removed |Added

 CC||bugzilla@poradnik-webmaster
   ||a.com

--- Comment #15 from Daniel Fruzynski  ---
+1 for this, I wanted to request this today too. I see that some patch is
ready, how is review going?

[Bug c++/83429] Incorrect line number reported by -Wformat-truncation

2017-12-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429

--- Comment #1 from Daniel Fruzynski  ---
Another test case, this time "note:" with argument range also points to
incorrect line:

[code]
#include 

struct S
{
unsigned char n;
char out[2];
};

void test(S* s) // line 9
{
snprintf(s->out, sizeof(s->out), "%d", s->n); // line 11
}
[/code]

[out]
test.cc: In function ‘void test(S*)’:
test.cc:9:6: warning: ‘%d’ directive output may be truncated writing between 1
and 3 bytes into a region of size 2 [-Wformat-truncation=]
 void test(S* s) // line 9
  ^~~~
test.cc:9:6: note: directive argument in the range [0, 255]
test.cc:11:13: note: ‘snprintf’ output between 2 and 4 bytes into a destination
of size 2
 snprintf(s->out, sizeof(s->out), "%d", s->n); // line 11
 ^~~~
[/out]

[Bug c++/83429] Incorrect line number reported by -Wformat-truncation

2017-12-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429

--- Comment #2 from Daniel Fruzynski  ---
Sometimes actual location is not reported at all:

[code]
#include 
#include 

struct S
{
char* str;
int n;
char out[10];
};

void test(S* s)
{
if (s->str)
snprintf(s->out, sizeof(s->out), "%d", s->n);
else
snprintf(s->out, sizeof(s->out), ".%s", s->str);
}
[/code]

[out]
test.cc: In function ‘void test(S*)’:
test.cc:11:6: warning: ‘%s’ directive argument is null [-Wformat-truncation=]
 void test(S* s)
  ^~~~
[/out]

[Bug c/83479] New: Register spilling in AVX code

2017-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479

Bug ID: 83479
   Summary: Register spilling in AVX code
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Here is snipped of code which performs some calculations on matrix. It
repeatedly transforms some (N * N) matrix into (N-1 * N-1) one, and returns
final scalar value. gcc for some reason is not able to detect that intermediate
values are not needed anymore, and starts spilling. Code below is from gcc 7.2,
trunk version also generates similar code. Code was compiled with "-O3
-march=haswell".
BTW, clang 5 properly handles this and does not spill.

[code]
#include "immintrin.h"

double test(const double data[9][8])
{
  __m256d vLastRow, vLastCol, vSqrtRow, vSqrtCol;

  __m256d v1 = _mm256_load_pd (&data[0][0]);
  __m256d v2 = _mm256_load_pd (&data[1][0]);
  __m256d v3 = _mm256_load_pd (&data[2][0]);
  __m256d v4 = _mm256_load_pd (&data[3][0]);
  __m256d v5 = _mm256_load_pd (&data[4][0]);
  __m256d v6 = _mm256_load_pd (&data[5][0]);
  __m256d v7 = _mm256_load_pd (&data[6][0]);
  __m256d v8 = _mm256_load_pd (&data[7][0]);

  // 8
  vLastRow = _mm256_load_pd (&data[9][0]);
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[3]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[4]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[5]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[6]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[7]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v8 = (v8 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 7
  vLastRow = v8;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[3]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[4]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[5]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[6]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 6
  vLastRow = v7;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[3]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[4]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[5]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 5
  vLastRow = v6;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtC

[Bug c/83479] Register spilling in AVX code

2017-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479

--- Comment #1 from Daniel Fruzynski  ---
Here is clang 5.0 output, it is also shorted than gcc one (213 lines, gcc
produced 247).

test(double const (*) [8]): # @test(double const (*) [8])
  vmovapd ymm3, ymmword ptr [rdi + 64]
  vmovapd ymm4, ymmword ptr [rdi + 128]
  vmovapd ymm5, ymmword ptr [rdi + 192]
  vmovapd ymm6, ymmword ptr [rdi + 256]
  vmovapd ymm8, ymmword ptr [rdi + 320]
  vmovapd ymm2, ymmword ptr [rdi + 384]
  vmovapd ymm1, ymmword ptr [rdi + 576]
  vsqrtpd ymm0, ymm1
  vmovupd ymmword ptr [rsp - 56], ymm0 # 32-byte Spill
  vpermpd ymm7, ymm1, 85 # ymm7 = ymm1[1,1,1,1]
  vsqrtpd ymm9, ymm7
  vmovapd ymm10, ymmword ptr [rdi + 448]
  vmulpd ymm7, ymm1, ymm7
  vsubpd ymm3, ymm3, ymm7
  vmulpd ymm3, ymm0, ymm3
  vpermpd ymm7, ymm1, 170 # ymm7 = ymm1[2,2,2,2]
  vsqrtpd ymm11, ymm7
  vmulpd ymm9, ymm9, ymm3
  vmulpd ymm3, ymm1, ymm7
  vsubpd ymm3, ymm4, ymm3
  vmulpd ymm3, ymm0, ymm3
  vpermpd ymm4, ymm1, 255 # ymm4 = ymm1[3,3,3,3]
  vsqrtpd ymm7, ymm4
  vmulpd ymm11, ymm3, ymm11
  vmulpd ymm3, ymm1, ymm4
  vsubpd ymm3, ymm5, ymm3
  vmulpd ymm3, ymm0, ymm3
  vmulpd ymm4, ymm3, ymm7
  vsqrtpd ymm7, ymm0
  vmulpd ymm3, ymm1, ymm0
  vsubpd ymm5, ymm6, ymm3
  vmulpd ymm5, ymm0, ymm5
  vmulpd ymm5, ymm5, ymm7
  vsubpd ymm6, ymm8, ymm3
  vmulpd ymm6, ymm0, ymm6
  vmulpd ymm6, ymm6, ymm7
  vsubpd ymm2, ymm2, ymm3
  vmulpd ymm2, ymm0, ymm2
  vmulpd ymm8, ymm2, ymm7
  vsubpd ymm2, ymm10, ymm3
  vmulpd ymm2, ymm0, ymm2
  vmulpd ymm3, ymm2, ymm7
  vsqrtpd ymm2, ymm3
  vpermpd ymm10, ymm3, 85 # ymm10 = ymm3[1,1,1,1]
  vsqrtpd ymm12, ymm10
  vmulpd ymm10, ymm3, ymm10
  vsubpd ymm9, ymm9, ymm10
  vmulpd ymm9, ymm2, ymm9
  vmulpd ymm9, ymm12, ymm9
  vpermpd ymm10, ymm3, 170 # ymm10 = ymm3[2,2,2,2]
  vsqrtpd ymm12, ymm10
  vmulpd ymm10, ymm3, ymm10
  vsubpd ymm10, ymm11, ymm10
  vmulpd ymm10, ymm2, ymm10
  vmulpd ymm10, ymm12, ymm10
  vpermpd ymm11, ymm3, 255 # ymm11 = ymm3[3,3,3,3]
  vsqrtpd ymm12, ymm11
  vmulpd ymm11, ymm3, ymm11
  vsubpd ymm4, ymm4, ymm11
  vmulpd ymm4, ymm2, ymm4
  vmulpd ymm11, ymm4, ymm12
  vmulpd ymm4, ymm3, ymm0
  vsubpd ymm5, ymm5, ymm4
  vmulpd ymm5, ymm2, ymm5
  vmulpd ymm12, ymm7, ymm5
  vsubpd ymm5, ymm6, ymm4
  vmulpd ymm6, ymm2, ymm5
  vsubpd ymm4, ymm8, ymm4
  vmulpd ymm4, ymm2, ymm4
  vmulpd ymm5, ymm7, ymm4
  vsqrtpd ymm4, ymm5
  vpermpd ymm8, ymm5, 85 # ymm8 = ymm5[1,1,1,1]
  vsqrtpd ymm13, ymm8
  vmulpd ymm6, ymm7, ymm6
  vmulpd ymm8, ymm5, ymm8
  vsubpd ymm8, ymm9, ymm8
  vmulpd ymm8, ymm4, ymm8
  vpermpd ymm9, ymm5, 170 # ymm9 = ymm5[2,2,2,2]
  vsqrtpd ymm14, ymm9
  vmulpd ymm13, ymm13, ymm8
  vmulpd ymm8, ymm5, ymm9
  vsubpd ymm8, ymm10, ymm8
  vmulpd ymm8, ymm4, ymm8
  vpermpd ymm9, ymm5, 255 # ymm9 = ymm5[3,3,3,3]
  vsqrtpd ymm10, ymm9
  vmulpd ymm14, ymm8, ymm14
  vmulpd ymm8, ymm5, ymm9
  vsubpd ymm8, ymm11, ymm8
  vmulpd ymm8, ymm4, ymm8
  vmulpd ymm9, ymm8, ymm10
  vmulpd ymm8, ymm5, ymm0
  vsubpd ymm10, ymm12, ymm8
  vmulpd ymm10, ymm4, ymm10
  vmulpd ymm10, ymm7, ymm10
  vsubpd ymm6, ymm6, ymm8
  vmulpd ymm6, ymm4, ymm6
  vmulpd ymm8, ymm7, ymm6
  vsqrtpd ymm6, ymm8
  vpermpd ymm11, ymm8, 85 # ymm11 = ymm8[1,1,1,1]
  vsqrtpd ymm12, ymm11
  vmulpd ymm11, ymm8, ymm11
  vsubpd ymm11, ymm13, ymm11
  vmulpd ymm11, ymm6, ymm11
  vmulpd ymm11, ymm11, ymm12
  vpermpd ymm12, ymm8, 170 # ymm12 = ymm8[2,2,2,2]
  vsqrtpd ymm13, ymm12
  vmulpd ymm12, ymm8, ymm12
  vsubpd ymm12, ymm14, ymm12
  vmulpd ymm12, ymm6, ymm12
  vmulpd ymm12, ymm12, ymm13
  vpermpd ymm13, ymm8, 255 # ymm13 = ymm8[3,3,3,3]
  vsqrtpd ymm14, ymm13
  vmulpd ymm13, ymm8, ymm13
  vsubpd ymm13, ymm9, ymm13
  vmulpd ymm9, ymm8, ymm0
  vsubpd ymm9, ymm10, ymm9
  vmulpd ymm9, ymm9, ymm6
  vmulpd ymm9, ymm7, ymm9
  vsqrtpd ymm7, ymm9
  vpermpd ymm10, ymm9, 85 # ymm10 = ymm9[1,1,1,1]
  vsqrtpd ymm15, ymm10
  vmulpd ymm13, ymm6, ymm13
  vmulpd ymm13, ymm13, ymm14
  vmulpd ymm10, ymm9, ymm10
  vsubpd ymm10, ymm11, ymm10
  vpermpd ymm11, ymm9, 170 # ymm11 = ymm9[2,2,2,2]
  vsqrtpd ymm14, ymm11
  vmulpd ymm10, ymm10, ymm7
  vmulpd ymm15, ymm10, ymm15
  vmulpd ymm10, ymm9, ymm11
  vsubpd ymm10, ymm12, ymm10
  vpermpd ymm11, ymm9, 255 # ymm11 = ymm9[3,3,3,3]
  vsqrtpd ymm12, ymm11
  vmulpd ymm0, ymm10, ymm7
  vmulpd ymm10, ymm9, ymm11
  vsubpd ymm10, ymm13, ymm10
  vmulpd ymm10, ymm7, ymm10
  vmulpd ymm11, ymm10, ymm12
  vsqrtpd ymm10, ymm11
  vpermpd ymm12, ymm11, 85 # ymm12 = ymm11[1,1,1,1]
  vsqrtpd ymm13, ymm12
  vmulpd ymm0, ymm0, ymm14
  vmulpd ymm12, ymm11, ymm12
  vsubpd ymm12, ymm15, ymm12
  vmulpd ymm12, ymm10, ymm12
  vpermpd ymm14, ymm11, 170 # ymm14 = ymm11[2,2,2,2]
  vsqrtpd ymm15, ymm14
  vmulpd ymm12, ymm13, ymm12
  vmulpd ymm13, ymm11, ymm14
  vsubpd ymm0, ymm0, ymm13
  vmulpd ymm0, ymm10, ymm0
  vmulpd ymm13, ymm15, ymm0
  vsqrtpd ymm0, ymm13
  vmovupd ymmword ptr [rsp - 88], ymm0 # 32-byte Spill
  vpermpd ymm14, ymm13, 85 # ymm14 = ymm13[1,1,1,1]
  vsqrtpd ymm15, ymm14
  vmulpd ymm14, ymm13, ymm14
  vsubpd ymm12, ymm12, ymm14
  vmulpd ymm12, ymm0, y

[Bug target/83479] Register spilling in AVX code

2017-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479

--- Comment #4 from Daniel Fruzynski  ---
Rule No.1: never log bugs before morning coffee ;)

This does not produce warnings, compiled with "-O3 -march=haswell -mavx512f
-mavx512vl -mavx512bw -mavx512dq -mavx512cd -Wall -Werror".
[code]
#include "immintrin.h"

double test(const double data[9][8])
{
  __m512d vLastRow, vLastCol, vSqrtRow, vSqrtCol;

  __m512d v1 = _mm512_load_pd (&data[0][0]);
  __m512d v2 = _mm512_load_pd (&data[1][0]);
  __m512d v3 = _mm512_load_pd (&data[2][0]);
  __m512d v4 = _mm512_load_pd (&data[3][0]);
  __m512d v5 = _mm512_load_pd (&data[4][0]);
  __m512d v6 = _mm512_load_pd (&data[5][0]);
  __m512d v7 = _mm512_load_pd (&data[6][0]);
  __m512d v8 = _mm512_load_pd (&data[7][0]);

  // 8
  vLastRow = _mm512_load_pd (&data[9][0]);
  vSqrtRow = _mm512_sqrt_pd(vLastRow);

  vLastCol = _mm512_set1_pd(vLastRow[0]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[1]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[2]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[3]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[4]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[5]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[6]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[7]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v8 = (v8 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 7
  vLastRow = v8;
  vSqrtRow = _mm512_sqrt_pd(vLastRow);

  vLastCol = _mm512_set1_pd(vLastRow[0]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[1]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[2]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[3]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[4]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[5]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[6]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 6
  vLastRow = v7;
  vSqrtRow = _mm512_sqrt_pd(vLastRow);

  vLastCol = _mm512_set1_pd(vLastRow[0]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[1]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[2]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[3]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[4]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[5]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 5
  vLastRow = v6;
  vSqrtRow = _mm512_sqrt_pd(vLastRow);

  vLastCol = _mm512_set1_pd(vLastRow[0]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[1]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[2]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[3]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[4]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 4
  vLastRow = v5;
  vSqrtRow = _mm512_sqrt_pd(vLastRow);

  vLastCol = _mm512_set1_pd(vLastRow[0]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_set1_pd(vLastRow[1]);
  vSqrtCol = _mm512_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm512_s

[Bug target/83479] Register spilling in AVX code

2017-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479

--- Comment #5 from Daniel Fruzynski  ---
Here is also valid AVX version, it also spills a bit. Compiled with "-O3
-march=haswell -Wall -Werror".

[code]
#include "immintrin.h"

double test(const double data[5][4])
{
  __m256d vLastRow, vLastCol, vSqrtRow, vSqrtCol;

  __m256d v1 = _mm256_load_pd (&data[0][0]);
  __m256d v2 = _mm256_load_pd (&data[1][0]);
  __m256d v3 = _mm256_load_pd (&data[2][0]);
  __m256d v4 = _mm256_load_pd (&data[3][0]);

  // 4
  vLastRow = _mm256_load_pd (&data[4][0]);
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[3]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 3
  vLastRow = v4;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[2]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 2
  vLastRow = v3;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;
  vLastCol = _mm256_set1_pd(vLastRow[1]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  // 1
  vLastRow = v2;
  vSqrtRow = _mm256_sqrt_pd(vLastRow);

  vLastCol = _mm256_set1_pd(vLastRow[0]);
  vSqrtCol = _mm256_sqrt_pd(vLastCol);
  v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol;

  return v1[0];
}
[/code]

[Bug target/83479] Register spilling in AVX code

2017-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479

--- Comment #6 from Daniel Fruzynski  ---
One correction: In c#4 line 17 has incorrect index, should be 8 instead of 9.
For some reason gcc did not complain here.

vLastRow = _mm512_load_pd (&data[8][0]);

[Bug middle-end/81914] [7 Regression] gcc 7.1 generates branch for code which was branchless in earlier gcc version

2017-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81914

--- Comment #12 from Daniel Fruzynski  ---
One more test case. Code compiled with TEST defined is branchless, without it
has branch.

[code]
#include 

#define TEST

void test(uint64_t* a)
{
  uint64_t n = *a / 8;
  if (0 == n)
n = 1;
#ifdef TEST
  *a += n;
#else
  *a += 1 << n;
#endif
}
[/code]

[Bug c/83610] New: __builtin_expect sometimes is ignored

2017-12-28 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610

Bug ID: 83610
   Summary: __builtin_expect sometimes is ignored
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void f1();
void f2();

void test(int a, int b, int c, int d, int n, int k)
{
  int val = a & b;
  if (__builtin_expect(!!(n == k), 0))
val &= c;
  if (__builtin_expect(!!(n == 10 - k), 0))
val &= d;
  if (val)
f1();
  else
f2();
}
[/code]

This code compiled with gcc 4.8.5 generates branches as expected:

[asm]
test(int, int, int, int, int, int):
  and edi, esi
  cmp r8d, r9d
  je .L6
.L2:
  mov eax, 10
  sub eax, r9d
  cmp r8d, eax
  je .L7
.L3:
  test edi, edi
  jne .L8
  jmp f2()
.L8:
  jmp f1()
.L7:
  and edi, ecx
  jmp .L3
.L6:
  and edi, edx
  jmp .L2
[/asm]

When this code is compiled with gcc 4.9.0 or higher, it generates branchless
code like below. In my case it is slower than version with branches. I wanted
to   convince compiler to generate this version of code by using
__builtin_expect, but for some reason it does not work.

[asm]
test(int, int, int, int, int, int):
  and esi, edi
  mov eax, 10
  and edx, esi
  cmp r8d, r9d
  cmove esi, edx
  sub eax, r9d
  and ecx, esi
  cmp r8d, eax
  cmove esi, ecx
  test esi, esi
  jne .L6
  jmp f2()
.L6:
  jmp f1()
[/asm]

[Bug c/83610] __builtin_expect sometimes is ignored

2017-12-28 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610

--- Comment #1 from Daniel Fruzynski  ---
Code was compiled with "-O3 -march=core2 -mtune=generic"

[Bug middle-end/83610] __builtin_expect sometimes is ignored

2017-12-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610

--- Comment #3 from Daniel Fruzynski  ---
Created attachment 42980
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42980&action=edit
Benchmark

Here is benchmark for this case. With unlikely() execution time decreases from
20.5sec to 20.3sec - about 1%. For my real app change it was a bit more than
2%.

Thanks for information about this parameter, I will give it a try. So far I
noticed that gcc uses CMOV when values are stored in registers. When they are
in memory as a class fields, it generates code with branches. I am still
playing with this code, so maybe I will need it later.

BTW, what do you thing about adding 3rd param to __builtin_expect, which will
specify probability? It may be helpful in cases like mine.

[Bug target/81759] Improve data tracking for _pext_u64 and __builtin_ffsll

2017-12-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81759

--- Comment #2 from Daniel Fruzynski  ---
Looks that __builtin_ffs does not check if input value is nonzero at all.
Assembler code for following code also has unnecessary instructions:

[code]
unsigned int test(unsigned int n)
{
  if (n == 0)
__builtin_unreachable();
  return __builtin_ffs(n) - 1;
}
[/code]

[Bug c/83634] New: ICE in useless_type_conversion_p, at gimple-expr.c:86

2017-12-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83634

Bug ID: 83634
   Summary: ICE in useless_type_conversion_p, at gimple-expr.c:86
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void test(unsigned int* ptr)
{
  const int foo = Foo();
  const int bar = Bar();
  unsigned short n;
  for (n = foo; n < 100; n += bar) {}
}
[/code]

I was playing with Compiler Explorer (https://godbolt.org/) and got this ICE
when compiling code above. g++ version reported by Compiler Explorer is g++
8.0.0 20171230.

[x86-64 gcc (trunk) #1] internal compiler error: tree check: expected class
'type', have 'exceptional' (error_mark) in useless_type_conversion_p, at
gimple-expr.c:86

[Bug c/83634] ICE in useless_type_conversion_p, at gimple-expr.c:86

2017-12-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83634

--- Comment #1 from Daniel Fruzynski  ---
A bit simpler test case which triggers this ICE:
[code]
void test()
{
  const int foo = Foo();
  short n;
  for (n = foo; n < 100; ++n) {}
}
[/code]

[Bug c/83671] New: Fix for false positive reported by -Wstringop-overflow does not work with inlining

2018-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83671

Bug ID: 83671
   Summary: Fix for false positive reported by -Wstringop-overflow
does not work with inlining
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Fix for bug 83373 does not work well with inlining:

[code]
#include 
#include 

char dest[20];
char src[10];

__attribute((nonnull(1, 2)))
inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src,
size_t size)
{
size_t len = strlen(src);
if (len < size)
memcpy(dst, src, len + 1);
else
{
memcpy(dst, src, size - 1);
dst[size - 1] = '\0';
}
return dst;
}

inline void func1()
{
my_strcpy(dest, src, sizeof(dest));
}

void func2()
{
func1();
}
[/code]

[out]
$ g++ -c -o test.o test.cc -Wall -Wstringop-overflow=2 -O1
In function ‘char* my_strcpy(char*, const char*, size_t)’,
inlined from ‘void func2()’ at test.cc:23:14:
test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ forming
offset [11, 19] is out of the bounds [0, 10] of object ‘src’ with type ‘char
[10]’ [-Warray-bounds]
 memcpy(dst, src, size - 1);
 ~~^~~~
test.cc: In function ‘void func2()’:
test.cc:5:6: note: ‘src’ declared here
 char src[10];
  ^~~
In function ‘char* my_strcpy(char*, const char*, size_t)’,
inlined from ‘void func2()’ at test.cc:23:14:
test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ reading 19
bytes from a region of size 10 [-Wstringop-overflow=]
 memcpy(dst, src, size - 1);
 ~~^~~~

$ gcc --version
gcc (GCC) 8.0.0 20171231 (experimental)
[/out]

[Bug target/82915] Please mark intrinsics as constexpr

2018-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82915

--- Comment #2 from Daniel Fruzynski  ---
SIMD ISAa for other CPU types (e.g. ARM/AARCH64 NEON) also can benefit from
this.

[Bug target/82915] Please mark intrinsics as constexpr

2018-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82915

--- Comment #4 from Daniel Fruzynski  ---
For tracking purposes it probably would be better to have separate issues for
every CPU type which could benefit this. So this one could be for x86, and you
could open other requests for other CPUs which supports SIMD instructions.

[Bug c/83688] New: Please check if buffers may overlap when copying strings

2018-01-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688

Bug ID: 83688
   Summary: Please check if buffers may overlap when copying
strings
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Functions like strcpy internally use memcpy to copy data. This may cause
problems when someone will try to use them to move string in buffer, e.g. to
strip prefix. gcc is able to detect if overlapping buffers are used with
memcpy. Please add similar diagnostics to strcpy/sprintf functions too.

[code]
#include 
#include 

char buf[20];

void test()
{
strcpy(buf, buf+5);
memcpy(buf, buf+5, strlen(buf+5)+1);

snprintf(buf, sizeof(buf), "%s", buf+5);

memcpy(buf, buf+5, 10);
}
[/code]

[out]
$ g++ -c -o test.o test.cc -O3 -Wall -Wextra -Wformat-overflow
-Wformat-truncation -Wstringop-overflow=2 -Wstringop-truncation
test.cc: In function ‘void test()’:
test.cc:13:11: warning: ‘void* memcpy(void*, const void*, size_t)’ accessing 10
bytes at offsets 0 and 5 overlaps 5 bytes at offset 5 [-Wrestrict]
 memcpy(buf, buf+5, 10);
 ~~^~~~

$ g++ --version
g++ (GCC) 8.0.0 20171231 (experimental)
[/out]

[Bug c/83688] Please check if buffers may overlap when copying strings

2018-01-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688

--- Comment #1 from Daniel Fruzynski  ---
This also would allow to catch code which use sprintf to concatenate strings,
what is an undefined behavior (snippet from
https://linux.die.net/man/3/snprintf):

sprintf(buf, "%s some further text", buf);

[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf

2018-01-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688

--- Comment #3 from Daniel Fruzynski  ---
Looks that something is not working properly. I have pasted output from
compilation of function in 1st post, and -Wrestrict complained only about last
memcpy call. Please take a look on this.

BTW, string concatenation using sprintf causes -Wformat-overflow warning, so
some protection against this is present. However this message does not say
anything that this is undefined behavior per C standard.

[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf

2018-01-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688

--- Comment #5 from Daniel Fruzynski  ---
> There is nothing to indicate that the first call to memcpy() in comment #0
> overlaps so -Wrestrict doesn't warn for it.

I thought that fix for bug 83373 will somehow help here. gcc could guess that
memcpy will copy from 1 to 15 bytes, which may overlap destination. In fact
this could help in all cases here except last memcpy.

[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf

2018-01-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688

--- Comment #7 from Daniel Fruzynski  ---
In general case yes, this can produce a lot of false positives. I wanted to use
this only for strings stored in fixed-size buffer. Existing string-related
warnings already uses this information, and this request is to extend
diagnostics for other related cases where strings in fixed-size buffers are
processed.

[Bug preprocessor/83773] New: Warning for redefined macro does not have its own -Wsomething switch

2018-01-10 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83773

Bug ID: 83773
   Summary: Warning for redefined macro does not have its own
-Wsomething switch
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: preprocessor
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Warning for redefined macro does not have its own -Wsomething switch, please
add one. I also tried to use -fdiagnostics-show-option but it did not help.

[code]
#define AAA 1
#define AAA 2
[/code]

[out]
test.c:2: warning: "AAA" redefined
 #define AAA 2

test.c:1: note: this is the location of the previous definition
 #define AAA 1
[/out]

[Bug c/83859] New: Please add new attribute which will establish relation between parameters for buffer and its size

2018-01-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83859

Bug ID: 83859
   Summary: Please add new attribute which will establish relation
between parameters for buffer and its size
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

gcc can detect if buffer size passed to function like strncpy is incorrect,
e.g. it is sizeof pointer. It would be good to have similar diagnostics enabled
for custom functions also accepts buffer and its size. Please add new function
attribute which would allow to do this and appropriate diagnostics which will
use it. I propose to add following attribute with two parameters - indices of
buffer and its size arguments. Note that function may accept multiple such
pairs, so it should be possible to use this attribute multiple times.

__attribute__((buffer_size(1, 2)))
void foo(char* dst, size_t dstsize);

[Bug c/84085] New: Array element is unnecessary loaded twice

2018-01-28 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84085

Bug ID: 84085
   Summary: Array element is unnecessary loaded twice
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#define N 9

struct S1
{
int a1[N][N];
};
struct S2
{
int a2[N][N];
int a3[N][N];
};

void test1(S1* s1, S2* s2)
{
s2->a2[N-1][N-1] = s1->a1[N-1][N-1];
s2->a3[N-1][N-1] = 1u << s1->a1[N-1][N-1];
}

void test2(S1* s1, S2* s2)
{
const int n = N*N-1;
*((&s2->a2[0][0] + n)) = *(&s1->a1[0][0] + n);
*((&s2->a3[0][0] + n)) = 1u << *(&s1->a1[0][0] + n);
}

void test3(S1* s1, S2* s2)
{
const int n = N*N-1;
int x = *(&s1->a1[0][0] + n);
*((&s2->a2[0][0] + n)) = x;
*((&s2->a3[0][0] + n)) = 1u << x;
}
[/code]

[out]
test1(S1*, S2*):
  mov ecx, DWORD PTR [rdi+320]
  mov eax, 1
  sal eax, cl
  mov DWORD PTR [rsi+320], ecx
  mov DWORD PTR [rsi+644], eax
  ret
test2(S1*, S2*):
  mov eax, DWORD PTR [rdi+320]
  mov DWORD PTR [rsi+320], eax
  mov ecx, DWORD PTR [rdi+320]
  mov eax, 1
  sal eax, cl
  mov DWORD PTR [rsi+644], eax
  ret
test3(S1*, S2*):
  mov ecx, DWORD PTR [rdi+320]
  mov eax, 1
  sal eax, cl
  mov DWORD PTR [rsi+320], ecx
  mov DWORD PTR [rsi+644], eax
  ret
[/out]

All 3 functions are equivalent. However when 2D array is treated as a 1D one,
gcc for some reason loads array element twice (function test2). Local variable
added in test3 allows to get the same code as for test1. I have found this
during writing code for AARCH64, but x86_64 is also affected.

gcc 8 (trunk) does not have this problem.

[Bug c/84106] New: gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size

2018-01-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106

Bug ID: 84106
   Summary: gcc is not able to vectorize code for 1D array, but
does so for 2D array of the same size
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#define N 9

int a1[N][N];
int a2[N][N];

int b1[N*N];
int b2[N*N];

void test1()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
}
}
}

void test2()
{
for (int i = 0; i < N*N; ++i)
{
b2[i] = b1[i];
}
}
[/code]

This code compiled using gcc 8.0 (trunk) with "-O3 -mavx2" produces following
result. For some reason gcc is not able to vectorize code for test2 function. I
also tried to add "__attribute__((aligned(32)))" to all arrays, but it did not
help.

Similar code is also generated when compiling with "-O3 -mavx512f -mavx512vl
-mavx512bw -mavx512dq -mavx512cd" - gcc still generates code which uses YMM
registers, instead of ZMM ones.

[out]
test1():
  vmovdqa ymm0, YMMWORD PTR a1[rip]
  vmovdqa ymm1, YMMWORD PTR a1[rip+32]
  vmovdqa ymm2, YMMWORD PTR a1[rip+64]
  vmovdqa ymm3, YMMWORD PTR a1[rip+96]
  vmovdqa YMMWORD PTR a2[rip], ymm0
  vmovdqa ymm4, YMMWORD PTR a1[rip+128]
  vmovdqa ymm5, YMMWORD PTR a1[rip+160]
  vmovdqa YMMWORD PTR a2[rip+32], ymm1
  vmovdqa ymm6, YMMWORD PTR a1[rip+192]
  vmovdqa ymm7, YMMWORD PTR a1[rip+224]
  vmovdqa ymm0, YMMWORD PTR a1[rip+256]
  vmovdqa ymm1, YMMWORD PTR a1[rip+288]
  vmovdqa YMMWORD PTR a2[rip+64], ymm2
  mov eax, DWORD PTR a1[rip+320]
  vmovdqa YMMWORD PTR a2[rip+96], ymm3
  vmovdqa YMMWORD PTR a2[rip+128], ymm4
  vmovdqa YMMWORD PTR a2[rip+160], ymm5
  vmovdqa YMMWORD PTR a2[rip+192], ymm6
  vmovdqa YMMWORD PTR a2[rip+224], ymm7
  vmovdqa YMMWORD PTR a2[rip+256], ymm0
  vmovdqa YMMWORD PTR a2[rip+288], ymm1
  mov DWORD PTR a2[rip+320], eax
  vzeroupper
  ret
test2():
  mov esi, OFFSET FLAT:b1
  mov edi, OFFSET FLAT:b2
  mov ecx, 40
  rep movsq
  mov eax, DWORD PTR [rsi]
  mov DWORD PTR [rdi], eax
  ret
b2:
  .zero 324
b1:
  .zero 324
a2:
  .zero 324
a1:
  .zero 324
[/out]

[Bug tree-optimization/84106] gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size

2018-01-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106

--- Comment #2 from Daniel Fruzynski  ---
Test included in comment 0 is part of bigger test which I performed. In full
version code was also computing bitmask and stored in 3rd array. For test1 gcc
was able to vectorize inner loop to series of load-shift-store-store
operations.  In test2 it separated loops into two - 1st one performing memcpy
using "rep movsq", 2nd one calculating bitmasks using vector instructions. Here
is full code and output:

[code]
#include 

#define N 9

int a1[N][N];
int a2[N][N];
int a3[N][N];

int b1[N*N];
int b2[N*N];
int b3[N*N];

void test1()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
a3[i][j] = 1u << (uint8_t)a1[i][j];
}
}
}

void test2()
{
for (int i = 0; i < N*N; ++i)
{
b2[i] = b1[i];
b3[i] = 1u << b1[i];
}
}
[/code]

[out]
test1():
  vmovdqa ymm0, YMMWORD PTR .LC0[rip]
  vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip]
  mov eax, 1
  vmovdqa ymm5, YMMWORD PTR a1[rip+96]
  vmovdqa ymm6, YMMWORD PTR a1[rip+128]
  vmovdqa ymm7, YMMWORD PTR a1[rip+160]
  vmovdqa ymm2, YMMWORD PTR a1[rip]
  vmovdqa YMMWORD PTR a3[rip], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip+32]
  vmovdqa ymm3, YMMWORD PTR a1[rip+32]
  vmovdqa YMMWORD PTR a2[rip], ymm2
  vmovdqa ymm2, YMMWORD PTR a1[rip+192]
  vmovdqa ymm4, YMMWORD PTR a1[rip+64]
  vmovdqa YMMWORD PTR a2[rip+32], ymm3
  vmovdqa ymm3, YMMWORD PTR a1[rip+224]
  vmovdqa YMMWORD PTR a3[rip+32], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip+64]
  vmovdqa YMMWORD PTR a2[rip+64], ymm4
  vmovdqa ymm4, YMMWORD PTR a1[rip+256]
  vmovdqa YMMWORD PTR a2[rip+96], ymm5
  vmovdqa YMMWORD PTR a3[rip+64], ymm1
  vpsllvd ymm1, ymm0, ymm5
  vmovdqa ymm5, YMMWORD PTR a1[rip+288]
  vmovdqa YMMWORD PTR a2[rip+128], ymm6
  vmovdqa YMMWORD PTR a3[rip+96], ymm1
  vpsllvd ymm1, ymm0, ymm6
  vmovdqa YMMWORD PTR a2[rip+160], ymm7
  vmovdqa YMMWORD PTR a3[rip+128], ymm1
  vpsllvd ymm1, ymm0, ymm7
  vmovdqa YMMWORD PTR a2[rip+192], ymm2
  vmovdqa YMMWORD PTR a3[rip+160], ymm1
  vpsllvd ymm1, ymm0, ymm2
  vmovdqa YMMWORD PTR a2[rip+224], ymm3
  vmovdqa YMMWORD PTR a3[rip+192], ymm1
  vpsllvd ymm1, ymm0, ymm3
  vmovdqa YMMWORD PTR a2[rip+256], ymm4
  vmovdqa YMMWORD PTR a3[rip+224], ymm1
  vpsllvd ymm1, ymm0, ymm4
  vpsllvd ymm0, ymm0, ymm5
  vmovdqa YMMWORD PTR a3[rip+256], ymm1
  vmovdqa YMMWORD PTR a2[rip+288], ymm5
  mov ecx, DWORD PTR a1[rip+320]
  vmovdqa YMMWORD PTR a3[rip+288], ymm0
  sal eax, cl
  mov DWORD PTR a2[rip+320], ecx
  mov DWORD PTR a3[rip+320], eax
  vzeroupper
  ret
test2():
  mov esi, OFFSET FLAT:b1
  mov edi, OFFSET FLAT:b2
  mov ecx, 40
  vmovdqa ymm0, YMMWORD PTR .LC0[rip]
  rep movsq
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip]
  mov ecx, DWORD PTR b1[rip+320]
  vmovdqa YMMWORD PTR b3[rip], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+32]
  vmovdqa YMMWORD PTR b3[rip+32], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+64]
  mov eax, DWORD PTR [rsi]
  mov DWORD PTR [rdi], eax
  mov eax, 1
  vmovdqa YMMWORD PTR b3[rip+64], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+96]
  sal eax, cl
  mov DWORD PTR b3[rip+320], eax
  vmovdqa YMMWORD PTR b3[rip+96], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+128]
  vmovdqa YMMWORD PTR b3[rip+128], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+160]
  vmovdqa YMMWORD PTR b3[rip+160], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+192]
  vmovdqa YMMWORD PTR b3[rip+192], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+224]
  vmovdqa YMMWORD PTR b3[rip+224], ymm1
  vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+256]
  vpsllvd ymm0, ymm0, YMMWORD PTR b1[rip+288]
  vmovdqa YMMWORD PTR b3[rip+256], ymm1
  vmovdqa YMMWORD PTR b3[rip+288], ymm0
  vzeroupper
  ret
b3:
  .zero 324
b2:
  .zero 324
b1:
  .zero 324
a3:
  .zero 324
a2:
  .zero 324
a1:
  .zero 324
.LC0:
  .long 1
  .long 1
  .long 1
  .long 1
  .long 1
  .long 1
  .long 1
  .long 1
[/out]

[Bug tree-optimization/84106] gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size

2018-01-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106

--- Comment #4 from Daniel Fruzynski  ---
Here are results of small benchmark executed on Xeon E5-2683 v3. Code was
compiled using gcc 4.8.5. This gcc version also splits loops. Manually
vectorized code is 3.5 times faster:

[out]
--
Benchmark   Time   CPU Iterations
--
BM_test1   25 ns 25 ns   26989634
BM_test27 ns  7 ns   94495591
[/out]

Benchmarko code:

[code]
#include 
#include "immintrin.h"

#define N 81

int a1[N] __attribute__((aligned(32)));
int a2[N] __attribute__((aligned(32)));
int a3[N] __attribute__((aligned(32)));

class Init
{
public:
Init()
{
for (int n = 0; n < N; n++)
{
a1[n] = n % 32;
}
}
} init;


static void BM_test1(benchmark::State& state)
{
for (auto _ : state)
{
for (int n = 0; n < N; n++)
{
a2[n] = a1[n];
a3[n] = 1 << a1[n];
}
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_test1);

static void BM_test2(benchmark::State& state)
{
for (auto _ : state)
{
int n = 0;
for (; n < N - 7; n += 8)
{
__m256i v = _mm256_load_si256((__m256i*)(&a1[0] + n));
_mm256_store_si256((__m256i*)(&a2[0] + n), v);

v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
_mm256_store_si256((__m256i*)(&a3[0] + n), v);
}
for (; n < N; n++)
{
a2[n] = a1[n];
a3[n] = 1 << a1[n];
}
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_test2);

BENCHMARK_MAIN();
[/code]

[Bug bootstrap/84199] New: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu): cannot load liblto_plugin.so

2018-02-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84199

Bug ID: 84199
   Summary: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu):
cannot load liblto_plugin.so
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Created attachment 43337
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43337&action=edit
Full build log

I was trying to build gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu) but build failed
with following error:

/gcc/build/./gcc/xgcc -B/gcc/build/./gcc/
-B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/bin/
-B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/lib/ -isystem
/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/include -isystem
/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/sys-include-O2  -g -O2 -DIN_GCC  
 -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes
-Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fPIC
-fno-inline -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector  -shared
-nodefaultlibs -Wl,--soname=libgcc_s.so.1 -Wl,--version-script=libgcc.map -o
./libgcc_s.so.1.tmp -g -O2 -B./ _thumb1_case_sqi_s.o _thumb1_case_uqi_s.o
_thumb1_case_shi_s.o _thumb1_case_uhi_s.o _thumb1_case_si_s.o _udivsi3_s.o
_divsi3_s.o _umodsi3_s.o _modsi3_s.o _bb_init_func_s.o _call_via_rX_s.o
_interwork_call_via_rX_s.o _lshrdi3_s.o _ashrdi3_s.o _ashldi3_s.o
_arm_negdf2_s.o _arm_addsubdf3_s.o 
[cut cut cut]
eqdf2_s.o gedf2_s.o ledf2_s.o muldf3_s.o negdf2_s.o subdf3_s.o unorddf2_s.o
fixdfsi_s.o floatsidf_s.o floatunsidf_s.o extendsfdf2_s.o truncdfsf2_s.o
enable-execute-stack_s.o unwind-arm_s.o libunwind_s.o pr-support_s.o
unwind-c_s.o emutls_s.o libgcc.a -lc && rm -f ./libgcc_s.so && if [ -f
./libgcc_s.so.1 ]; then mv -f ./libgcc_s.so.1 ./libgcc_s.so.1.backup; else
true; fi && mv ./libgcc_s.so.1.tmp ./libgcc_s.so.1 && (echo "/* GNU ld script";
echo "   Use the shared library, but some functions are only in"; echo "   the
static library.  */"; echo "GROUP ( libgcc_s.so.1 -lgcc )" ) > ./libgcc_s.so
/usr/bin/ld: /gcc/build/./gcc/liblto_plugin.so: error loading plugin:
/gcc/build/./gcc/liblto_plugin.so: cannot open shared object file: No such file
or directory
collect2: error: ld returned 1 exit status
Makefile:977: recipe for target 'libgcc_s.so' failed
make[3]: *** [libgcc_s.so] Error 1
make[3]: Leaving directory '/gcc/build/armv7l-unknown-linux-gnueabihf/libgcc'
Makefile:21293: recipe for target 'all-stage2-target-libgcc' failed
make[2]: *** [all-stage2-target-libgcc] Error 2
make[2]: Leaving directory '/gcc/build'
Makefile:26191: recipe for target 'stage2-bubble' failed
make[1]: *** [stage2-bubble] Error 2
make[1]: Leaving directory '/gcc/build'
Makefile:939: recipe for target 'all' failed
make: *** [all] Error 2


odroid@odroid-linux-1:~$ gcc --version
gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.6) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

odroid@odroid-linux-1:~$ uname -a
Linux odroid-linux-1 3.10.105-138 #1 SMP PREEMPT Fri Apr 7 12:40:29 UTC 2017
armv7l armv7l armv7l GNU/Linux

odroid@odroid-linux-1:~$ cat /etc/*release*
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/";
SUPPORT_URL="http://help.ubuntu.com/";
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/";
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
odroid@odroid-linux-1:~$

[Bug tree-optimization/84106] loop distribution cost-model needs work

2018-02-05 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106

--- Comment #6 from Daniel Fruzynski  ---
When you will be revisiting your cost-model for loops, please also take a look
on this code. test2 has one assignment moved to separate loops, and it is about
twice as fast as test1 function (for gcc 4.8.5).

[code]
#include 
#include 

#define N 9

int a1[N][N];
int a2[N][N];
int a3[N][N];
uint16_t a4[N][N-1];

void test1()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
a3[i][j] = 1u << a1[i][j];
if (i > 0)
  a4[j][i-1] = a3[i][j];
   }
}
}

void test2()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
a3[i][j] = 1u << a1[i][j];
}
}
for (int i = 1; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a4[j][i-1] = a3[i][j];
}
}
}
[/code]

[Bug c++/89317] New: Ineffective code from std::copy

2019-02-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317

Bug ID: 89317
   Summary: Ineffective code from std::copy
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

gcc produces ineffective code when std::copy is used to copy data. For test I
created my own version of std::copy and this version is optimized properly.

Compiles using g++ (GCC-Explorer-Build) 9.0.1 20190211 (experimental)
Options: -O3 -std=c++11 -march=skylake

[code]
#include 
#include 

#define Size 8

class Test
{
public:
void test1(void*__restrict ptr);
void test2(void*__restrict ptr);

private:
int16_t data1[Size];
int16_t data2[Size];
};

template
void mycopy(T1 begin, T1 end, T2 dest)
{
while (begin != end)
{
*dest = *begin;
++dest;
++begin;
}
}

void Test::test1(void*__restrict ptr)
{
uint16_t* p = (uint16_t*)ptr;

std::copy(data1, data1 + Size, p);
p += Size;
std::copy(data2, data2 + Size, p);
}

void Test::test2(void*__restrict ptr)
{
int16_t* p = (int16_t*)ptr;

mycopy(data1, data1 + Size, p);
p += Size;
mycopy(data2, data2 + Size, p);
}
[/code]

[asm]
Test::test1(void*):
movzx   eax, WORD PTR [rdi]
mov edx, 16
mov WORD PTR [rsi], ax
movzx   eax, WORD PTR [rdi+2]
add rsi, 16
mov WORD PTR [rsi-14], ax
movzx   eax, WORD PTR [rdi+4]
mov WORD PTR [rsi-12], ax
movzx   eax, WORD PTR [rdi+6]
mov WORD PTR [rsi-10], ax
movzx   eax, WORD PTR [rdi+8]
mov WORD PTR [rsi-8], ax
movzx   eax, WORD PTR [rdi+10]
mov WORD PTR [rsi-6], ax
movzx   eax, WORD PTR [rdi+12]
mov WORD PTR [rsi-4], ax
movzx   eax, WORD PTR [rdi+14]
mov WORD PTR [rsi-2], ax
mov rax, rdx
sar rax
testrdx, rdx
jle .L69
movzx   edx, WORD PTR [rdi+16]
mov WORD PTR [rsi], dx
cmp rax, 1
je  .L69
movzx   edx, WORD PTR [rdi+18]
mov WORD PTR [rsi+2], dx
cmp rax, 2
je  .L69
movzx   edx, WORD PTR [rdi+20]
mov WORD PTR [rsi+4], dx
cmp rax, 3
je  .L69
movzx   edx, WORD PTR [rdi+22]
mov WORD PTR [rsi+6], dx
cmp rax, 4
je  .L69
movzx   edx, WORD PTR [rdi+24]
mov WORD PTR [rsi+8], dx
cmp rax, 5
je  .L69
movzx   edx, WORD PTR [rdi+26]
mov WORD PTR [rsi+10], dx
cmp rax, 6
je  .L69
movzx   edx, WORD PTR [rdi+28]
mov WORD PTR [rsi+12], dx
cmp rax, 7
je  .L69
movzx   edx, WORD PTR [rdi+30]
mov WORD PTR [rsi+14], dx
cmp rax, 8
je  .L69
movzx   edx, WORD PTR [rdi+32]
mov WORD PTR [rsi+16], dx
cmp rax, 9
je  .L69
movzx   edx, WORD PTR [rdi+34]
mov WORD PTR [rsi+18], dx
cmp rax, 10
je  .L69
movzx   edx, WORD PTR [rdi+36]
mov WORD PTR [rsi+20], dx
cmp rax, 11
je  .L69
movzx   edx, WORD PTR [rdi+38]
mov WORD PTR [rsi+22], dx
cmp rax, 12
je  .L69
movzx   edx, WORD PTR [rdi+40]
mov WORD PTR [rsi+24], dx
cmp rax, 13
je  .L69
movzx   edx, WORD PTR [rdi+42]
mov WORD PTR [rsi+26], dx
cmp rax, 14
je  .L69
movzx   eax, WORD PTR [rdi+44]
mov WORD PTR [rsi+28], ax
.L69:
ret
Test::test2(void*):
vmovdqu xmm0, XMMWORD PTR [rdi]
vmovups XMMWORD PTR [rsi], xmm0
vmovdqu xmm1, XMMWORD PTR [rdi+16]
vmovups XMMWORD PTR [rsi+16], xmm1
ret
[/asm]

[Bug tree-optimization/89317] Ineffective code from std::copy

2019-02-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317

--- Comment #2 from Daniel Fruzynski  ---
Yes, I mean inefficient.

[Bug c/90293] New: New function attribute: expect_return

2019-04-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293

Bug ID: 90293
   Summary: New function attribute: expect_return
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I have an idea of new function attribute: expect_return. It would allow to
specify value usually returned from function, so it could help with
optimization in similar way like __builtin_expect() does.

Example use:

__attribute__((expect_return(false)))
bool DebugModeEnabled();

__attribute__((expect_return(false)))
bool IsErrorCode(int code);

[Bug c/90293] New function attribute: expect_return

2019-04-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293

--- Comment #1 from Daniel Fruzynski  ---
One more case: sometimes it may be more handy to specify what will *not* be
usually returned, e.g. special invalid value. For such cases another attribute
would be needed:

__attribute__((expect_not_return(-1)))
int CreateSocket();

[Bug c/90471] New: ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

Bug ID: 90471
   Summary: ICE Segmentation fault when compiling with debug info
   Product: gcc
   Version: 7.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Created attachment 46353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46353&action=edit
Preprocessed code

I got ICE Segmentation fault when trying to build OpenCL BOINC app which I am
developing. This happen only when I use -g option, without it code compiles
fine.

I compiled code using MinGW crossompiler shipped with Cygwin. Exact versions of
all mingw packages are on attached screen. I also attached preprocessed source.
I use 64-bit Cygwin on 64-bit Win 10 Pro with latest patches.

When I was trying to remove unimportant parts of source code, I found
interesting thing: I was able to comment out boinc_opencl.h include and crash
still happen. However when I removed this line completely, gcc did not crash.
This part of code looks as follows:

[code]
#define __CL_ENABLE_EXCEPTIONS
#define CL_TARGET_OPENCL_VERSION 120
#define CL_USE_DEPRECATED_OPENCL_1_1_APIS
#include "CL/cl.hpp"
//#include "boinc_opencl.h"

class OclException : public std::exception
[/code]

I can attach original files and all relevant headers if you need them too.

$ x86_64-w64-mingw32-g++ -O3 -ftree-vectorize -std=c++11 -Wall -pthread
-I/cygdrive/c/rakesearch/_boinc -I/cygdrive/c/rakesearch/_boinc/lib
-I/cygdrive/c/rakesearch/_boinc/include/boinc -I. -D_BSD_SOURCE -g -c
RakeSearchOpenCL2.cpp -o RakeSearchOpenCL.o

RakeSearchOpenCL2.cpp: In member function ‘bool RakeSearchOpenCL::init(int,
char**)’:
RakeSearchOpenCL2.cpp:99:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.



$ x86_64-w64-mingw32-g++ --version
x86_64-w64-mingw32-g++ (GCC) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ x86_64-w64-mingw32-g++ -v
Using built-in specs.
COLLECT_GCC=x86_64-w64-mingw32-g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-w64-mingw32/7.4.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with:
/cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0/configure
--srcdir=/cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0
--prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
--docdir=/usr/share/doc/mingw64-x86_64-gcc
--htmldir=/usr/share/doc/mingw64-x86_64-gcc/html -C --build=x86_64-pc-cygwin
--host=x86_64-pc-cygwin --target=x86_64-w64-mingw32 --without-libiconv-prefix
--without-libintl-prefix --with-sysroot=/usr/x86_64-w64-mingw32/sys-root
--with-build-sysroot=/usr/x86_64-w64-mingw32/sys-root --disable-multilib
--disable-win32-registry --enable-languages=c,c++,fortran,lto,objc,obj-c++
--enable-fully-dynamic-string --enable-graphite --enable-libgomp
--enable-libquadmath --enable-libquadmath-support --enable-libssp
--enable-version-specific-runtime-libs --enable-libgomp --enable-libada
--with-dwarf2 --with-gnu-ld --with-gnu-as --with-tune=generic
--with-cloog-include=/usr/include/cloog-isl --with-system-zlib
--enable-threads=posix --libexecdir=/usr/lib
Thread model: posix
gcc version 7.4.0 (GCC)

  1   2   >