[Bug tree-optimization/96512] New: wrong code generated with avx512 intrinsics in some cases

2020-08-06 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

Bug ID: 96512
   Summary: wrong code generated with avx512 intrinsics in some
cases
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

Created attachment 49013
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49013&action=edit
bug demonstrator

with gcc 8.3.0 (and also possibly some other versions), I found a nasty bug
that messes up the results of my calculations.
I worked hard to produce a simple reproduced, which is attached.

With gcc 8.3.0, compiling with
   gcc -O1 -g -D_GCC_VEC_=1 -march=skylake-avx512 bug_gcc_avx512.c

running ./a.out leads to a wrong result, displayed like so:
   ERROR :: 0.874347 == 0

Examining the generated assembly, I suspect this instruction to be wrong:
0x00401186 <+100>:   vbroadcastsd zmm0,QWORD PTR [r8*8+0x1]
because r8 is aligned, and the 0x1 offset does not seem right...

When compiling with -march=skylake the problem goes away.
When using "alloca" instead of variable length array, the problem goes away
When inserting a printf, the problem goes away.

See in the attached source file. The lines commented with "NO BUG" make the bug
go away.

This has been a nightmare to spot, as it does not happen on all compiler
versions. I hope somebody can reproduce it and fix it...
Note that the assembly generated on godbolt does not seem to have this issue...

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-08-06 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

--- Comment #1 from N Schaeffer  ---
Created attachment 49014
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49014&action=edit
even simpler bug demonstrator

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-08-07 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

--- Comment #3 from N Schaeffer  ---
(In reply to Richard Biener from comment #2)
> With trunk and GCC 10 I see
> 
> vbroadcastsdzmm0, QWORD PTR [8+r8*8]
> 
> can you check newer GCC?  GCC 8.4 is out since some time already and I do
> remember some fixes to intrinsics.

I've tested with GCC 9.1 and 10.1 which do not seem affected.
However, it is a very sneaky bug. On the larger original function, the
workaround was to compile with -fno-tree-pre
On the bug demonstrator, the bug shows up already at -O1 and hence
-fno-tree-pre has no effect.
I fear the issue may still be around, waiting for the right conditions to show
up.
So if someone can understand where it comes from in this bug demonstrator with
gcc 8.3.0, it may be possible to ensure it is fixed "forever".

It may also be an issue on particular installations, mixing several compilers,
as godbolt with gcc 8.3 does not produce wrong assembly.
Is there a possibility that a wrong immintrin.h is used ? How can I see what
path is used for a #include ?

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-09-04 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

--- Comment #6 from N Schaeffer  ---
Hello,

Working further on this, it seems to be a problem in the assembler step, but
only on some installations.

I have a system where gcc 8.3 to 9 and 10 are good (no bug), while another
system where gcc 8.3, 9.1 and 10.1 are NOT good (bug!)

On the buggy system, when doing:
   gcc -O1 -D_GCC_VEC_=1 -march=skylake-avx512 -c bug_gcc_avx512.c
and disasembling with gdb, one can see the offending instruction has been
generated:
vbroadcastsd 0x1(,%r8,8),%zmm

but when outputing assembly code like so:
   gcc -O1 -D_GCC_VEC_=1 -march=skylake-avx512 -S bug_gcc_avx512.c
the instruction in the bug_gcc_avx512.s file reads:
   vbroadcastsd8(,%r8,8), %zmm0
invoking now the assembler:
as bug_gcc_avx512.s, the offending instruction is indeed generated.
 vbroadcastsd 0x1(,%r8,8),%zmm0


So here are the "as --version" on various systems:

GNU assembler version 2.27-41.base.el7_7.3  ==> NO BUG
GNU assembler (GNU Binutils for Debian) 2.28 ==> NO BUG

GNU assembler version 2.30-58.el8_1.2 ==> BUG!

Assembleur GNU (GNU Binutils) 2.34  ==> NO BUG
Assembleur GNU (GNU Binutils) 2.35  ==> NO BUG


Maybe I should post this bug report somewhere else?

[Bug c/93334] New: -O3 generates useless code checking for overlapping memset ?

2020-01-20 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334

Bug ID: 93334
   Summary: -O3 generates useless code checking for overlapping
memset ?
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

It seems that trying to zero out two arrays in the same loop results in poor
code beeing generated by -O3.
If I understand it correctly, the generated code tries to identify if the
arrays overlap. If it is the case the code then falls back to simple loops
instead of calls to memset.

I wonder why overlapping memset is an issue?
I this some inherited behaviour from dealing with memcpy?

In case 4 arrays are zeroed together, about 40 instructions are generated to
check for mutual overlap... This does not seem to be necessary.
Other compilers (clang, icc) don't do that.

See issue here, with assembly generated:
https://godbolt.org/z/SSWVhm

And I copy the code below for reference too:

void test_simple_code(long l, double* mem, long ofs2) {
for (long k=0; k

[Bug tree-optimization/93342] New: wrong AVX mask generation with -funsafe-math-optimizations

2020-01-20 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93342

Bug ID: 93342
   Summary: wrong AVX mask generation with
-funsafe-math-optimizations
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

When trying to produce a xor mask to negate even elements in an AVX vector, gcc
produces wrong code with -funsafe-math-optimizations.

I've tried several ways, all giving the same wrong answer: a mask negating ALL
elements instead of just the even ones.
Since the mask is generated using INTEGER arithmetic, I don't understand the
issue here.

The only correct way with avx is to define a variable with the mask already
set.
With avx2, one can use integer intrinsics, which will produce correct mask.

The code showing the bug can be seen here.
https://godbolt.org/z/q9eamc

For the record, I also copy the code below.
When compiling the following with -O -mavx2 -funsafe-math-optimizations -S, the
mask is wrong. Without -funsafe-math-optimizations it is correct.
Since the mask is generated using integer arithmetic, I don't understand the
issue here, as -funsafe-math-optimizations only affects floating point
(according to man page).
Even stranger, the same mask, but now xor-ed using integer avx2 intrinsics
gives  the correct resuts...

#include 
typedef __m128d v2d;
typedef __m256d v4d;

// generates: vxorpd  ymm0, ymm0, YMMWORD PTR wrong_mask
v4d negate_even_fail(v4d v) {
__m256i mask = _mm256_setr_epi32(0,-2147483648, 0,0, 0,-2147483648, 0,0);
return _mm256_xor_pd(v, _mm256_castsi256_pd(mask));
}

// generates: vxorpd  ymm0, ymm0, YMMWORD PTR correct_mask
v4d negate_even_does_not_fail(v4d v) {
__m256i mask = _mm256_setr_epi32(0,-2147483648, 0,0, 0,-2147483648, 0,0);
return _mm256_castsi256_pd(_mm256_xor_si256(_mm256_castpd_si256(v), mask));
}

[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?

2020-01-21 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334

--- Comment #3 from N Schaeffer  ---
Hi,

Thanks for pointing out the issue about writing different values. This makes
sense.
However, since memset deals with bytes, whenever the type of array is floating
point data (or anything longer than bytes), it will not be possible to use
memset to set different values.
Indeed, the code snippet you propose is not compiled with memset for 1.0.

So I think only zeros and NaNs are possible to optimize to memset anyway (and
some other very special cases, that is probably not worth considering anyway).

[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?

2020-01-21 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334

--- Comment #5 from N Schaeffer  ---
Elaborating a bit on this:

I can eliminate this problem by using:
   -O3 -fno-tree-loop-distribute-patterns -fno-tree-loop-vectorize

I wonder why -fno-tree-loop-distribute-patterns is not enough ?
In that case, I get no calls to memset, but still the write-after-write
dependency check.

Also, decorating the loop with
   #pragma omp simd
AND compiling with
   -O3  -march=core-avx2  -fopenmp-simd  -fno-tree-loop-distribute-patterns
finally generates sensible code.

Note that with -fno-tree-loop-distribute-patterns, I still get calls to memset
instead of a simd-vectorized loop...

[Bug target/93395] New: AVX2 missed optimization : _mm256_permute_pd() is unfortunately translated into the more expensive VPERMPD instead of the cheap VPERMILPD

2020-01-22 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93395

Bug ID: 93395
   Summary: AVX2 missed optimization : _mm256_permute_pd() is
unfortunately translated into the more expensive
VPERMPD instead of the cheap VPERMILPD
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

According to Agner Fog's instruction timing tables
and to my own measurements, VPERMPD has a 3 cycle latency, while VPERMILPD has
a 1 cycle latency on most CPUs.

Yet, the intrinsic _mm256_permute_pd() is always translated with VPERMPD, while
this intrinsic maps directly to the VPERMILPD instruction.
This makes the code SLOWER.

It should be the opposite: the _mm256_permute4x64_pd() intrinsic, which maps
the VPERMPD instruction should, when possible, be translated into VPERMILPD.

Note that clang does the right thing here.

The same problem arises for AVX-512.

See assembly generated here: https://godbolt.org/z/VZe8qk

I replicate the code here for completeness:

#include 


// translated into   "vpermpd ymm0, ymm0, 177"
// which is OK, but  "vpermilpd ymm0, ymm0, 5"   does the same thing faster.
__m256d perm_missed_optimization(__m256d a) {
return _mm256_permute4x64_pd(a,0xB1);
}

// translated into   "vpermpd ymm0, ymm0, 177"
// which is 3 times slower than the original intent of   "vpermilpd ymm0, ymm0,
5"
__m256d perm_pessimization(__m256d a) {
return _mm256_permute_pd(a,0x5);
}

// adequately translated into  "vshufpd ymm0, ymm0, ymm0, 5"
// which does the same as   "vpermilpd ymm0, ymm0, 5"   at the same speed.
__m256d perm_workaround(__m256d a) {
return _mm256_shuffle_pd(a, a, 5);
}

[Bug c++/60237] New: isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237

Bug ID: 60237
   Summary: isnan fails with -ffast-math
   Product: gcc
   Version: 4.8.1
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com

With -ffast-math, isnan should return true if passed a NaN value.
Otherwise, how is isnan different than (x!=x) ?

isnan worked as expected with gcc 4.7, but does not with 4.8.1 and 4.8.2

How can I check if x is a NaN in a portable way (not presuming any compilation
option) ?


[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237

--- Comment #2 from N Schaeffer  ---
Thank you for your answer.

My program (which is a computational fluid dynamics solver) is not supposed to
produce NaNs. However, when it does (which means something went wrong), I would
like to abort the program and return an error instead of continuing crunching
NaNs.
I also want it to run as fast as possible (hence the -ffast-math option).

I would argue that: if printf("%f",x) outputs "NaN", isnan(x) should also be
returning true.

Do you have a suggestion concerning my last question:
How can I check if x is NaN in a portable way (not presuming any compilation
option) ?


[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237

--- Comment #4 from N Schaeffer  ---
int my_isnan(double x){
  volatile double y=x;
  return y!=y;
}

is translated to:
   0x00406cf0 <+0>: movsd  QWORD PTR [rsp-0x8],xmm0
   0x00406cf6 <+6>: xoreax,eax
   0x00406cf8 <+8>: movsd  xmm1,QWORD PTR [rsp-0x8]
   0x00406cfe <+14>:movsd  xmm0,QWORD PTR [rsp-0x8]
   0x00406d04 <+20>:comisd xmm1,xmm0
   0x00406d08 <+24>:setne  al
   0x00406d0b <+27>:ret

which also fails to detect NaN, which is right according to the documented
behaviour of comisd:
http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc44.htm


[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237

--- Comment #6 from N Schaeffer  ---
-fno-builtin-isnan is also interesting, thanks.

Is there somewhere a rationale for not making isnan() find NaN's with
-ffinite-math-only ?


[Bug tree-optimization/98563] New: regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563

Bug ID: 98563
   Summary: regression: vectorization fails while it worked on gcc
9 and earlier
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

I have found what seems to be a regression.

The following code is not compiled to 256-bit AVX when compiled with
-fopenmp-simd, while it is fully vectorized without!

Here are the resulting code with different options, with gcc 10.1:
-O3 -fopenmp-simd  => xmm
-O3=> ymm
-O3 -fopenmp-simd -fno-signed-zeros  => ymm

gcc 9 and earlier always vectorize to full-width (ymm)

#include 
typedef std::complex cplx;

void test(cplx* __restrict__ a, const cplx* b, double c, int N)
{
#pragma omp simd
for (int i=0; i<8*N; i++) {
a[i] = c*(a[i]-b[i]);
}
}

See the result on godbolt: https://godbolt.org/z/9ThqKE

Also, I discover that no avx512 code is generated for this loop. Is this
intended? Is there an option to enable avx512 vectorization?

[Bug tree-optimization/98563] regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563

--- Comment #1 from N Schaeffer  ---
I just found the -mprefer-vector-width=512 to force to use zmm.
The reported regression however remains.

[Bug tree-optimization/98563] [10/11 Regression] vectorization fails while it worked on gcc 9 and earlier

2021-01-07 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563

--- Comment #3 from N Schaeffer  ---
I'd like to add that when you say "vectorization of the basic block", the code
generated is actually worse than non-vectorized naive code: it handles all
loads and arithmetic operations in scalar mode (v*sd instructions) and packs
two values into xmm before storing...

[Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Bug ID: 114107
   Summary: poor vectorization at -O3 when dealing with arrays of
different multiplicity, good with -O2
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

A simple loop multiplying two arrays, with different multiplicity fails to
vectorize efficiently with -O3.
Target is AVX x86_64.
The loop is the following, where 4 consecutive values in data are multiplied by
the same factor :

for (int i=0; ihttps://godbolt.org/z/fWj34bbhq

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #3 from N Schaeffer  ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10
times slower than the vbroadcastsd loop.
If you want, I can benchmark it tomorrow.

If this is a cost model problem, it is a bad one. Even ignoring the decoding of
all these instructions, how can adding 6 vpermd to each vmulpd be faster?
I would rather think (hope?) the optimizer does not consider the vbroadcastsd
solution at all.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #4 from N Schaeffer  ---
... and thank you for your quick reply!

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #6 from N Schaeffer  ---
indeed, aarch64 assembly looks very good.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #9 from N Schaeffer  ---
In addition, optimizing for size with -Os leads to a non-vectorized double-loop
(51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os)
leads to 40 bytes.
It is thus also a missed optimization for -Os.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #10 from N Schaeffer  ---
intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal
code using vbroadcastsd with the following options:

-O2 -march=skylake -ftree-vectorize

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-26 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #12 from N Schaeffer  ---
I found the "offending" option, and it seems to be indeed a cost-model problem
as Andrew Pinski said:

good code is generated by:

   gcc -O2 -ftree-vectorize -march=skylake   (since gcc 6.1)
   gcc -O1 -ftree-vectorize -march=skylake   (since gcc 8.1)
   gcc -O3 -fvect-cost-model=very-cheap -march=skylake   (with gcc 13.1+)

bad code is generated otherwise, and in particular:

   gcc -O2 -march=skylake  (does not vectorize)
   gcc -O3 -march=skylake  (bad vectorization with so many permutations)