[Bug tree-optimization/96512] New: wrong code generated with avx512 intrinsics in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 Bug ID: 96512 Summary: wrong code generated with avx512 intrinsics in some cases Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- Created attachment 49013 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49013&action=edit bug demonstrator with gcc 8.3.0 (and also possibly some other versions), I found a nasty bug that messes up the results of my calculations. I worked hard to produce a simple reproduced, which is attached. With gcc 8.3.0, compiling with gcc -O1 -g -D_GCC_VEC_=1 -march=skylake-avx512 bug_gcc_avx512.c running ./a.out leads to a wrong result, displayed like so: ERROR :: 0.874347 == 0 Examining the generated assembly, I suspect this instruction to be wrong: 0x00401186 <+100>: vbroadcastsd zmm0,QWORD PTR [r8*8+0x1] because r8 is aligned, and the 0x1 offset does not seem right... When compiling with -march=skylake the problem goes away. When using "alloca" instead of variable length array, the problem goes away When inserting a printf, the problem goes away. See in the attached source file. The lines commented with "NO BUG" make the bug go away. This has been a nightmare to spot, as it does not happen on all compiler versions. I hope somebody can reproduce it and fix it... Note that the assembly generated on godbolt does not seem to have this issue...
[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #1 from N Schaeffer --- Created attachment 49014 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49014&action=edit even simpler bug demonstrator
[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #3 from N Schaeffer --- (In reply to Richard Biener from comment #2) > With trunk and GCC 10 I see > > vbroadcastsdzmm0, QWORD PTR [8+r8*8] > > can you check newer GCC? GCC 8.4 is out since some time already and I do > remember some fixes to intrinsics. I've tested with GCC 9.1 and 10.1 which do not seem affected. However, it is a very sneaky bug. On the larger original function, the workaround was to compile with -fno-tree-pre On the bug demonstrator, the bug shows up already at -O1 and hence -fno-tree-pre has no effect. I fear the issue may still be around, waiting for the right conditions to show up. So if someone can understand where it comes from in this bug demonstrator with gcc 8.3.0, it may be possible to ensure it is fixed "forever". It may also be an issue on particular installations, mixing several compilers, as godbolt with gcc 8.3 does not produce wrong assembly. Is there a possibility that a wrong immintrin.h is used ? How can I see what path is used for a #include ?
[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #6 from N Schaeffer --- Hello, Working further on this, it seems to be a problem in the assembler step, but only on some installations. I have a system where gcc 8.3 to 9 and 10 are good (no bug), while another system where gcc 8.3, 9.1 and 10.1 are NOT good (bug!) On the buggy system, when doing: gcc -O1 -D_GCC_VEC_=1 -march=skylake-avx512 -c bug_gcc_avx512.c and disasembling with gdb, one can see the offending instruction has been generated: vbroadcastsd 0x1(,%r8,8),%zmm but when outputing assembly code like so: gcc -O1 -D_GCC_VEC_=1 -march=skylake-avx512 -S bug_gcc_avx512.c the instruction in the bug_gcc_avx512.s file reads: vbroadcastsd8(,%r8,8), %zmm0 invoking now the assembler: as bug_gcc_avx512.s, the offending instruction is indeed generated. vbroadcastsd 0x1(,%r8,8),%zmm0 So here are the "as --version" on various systems: GNU assembler version 2.27-41.base.el7_7.3 ==> NO BUG GNU assembler (GNU Binutils for Debian) 2.28 ==> NO BUG GNU assembler version 2.30-58.el8_1.2 ==> BUG! Assembleur GNU (GNU Binutils) 2.34 ==> NO BUG Assembleur GNU (GNU Binutils) 2.35 ==> NO BUG Maybe I should post this bug report somewhere else?
[Bug c/93334] New: -O3 generates useless code checking for overlapping memset ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 Bug ID: 93334 Summary: -O3 generates useless code checking for overlapping memset ? Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- It seems that trying to zero out two arrays in the same loop results in poor code beeing generated by -O3. If I understand it correctly, the generated code tries to identify if the arrays overlap. If it is the case the code then falls back to simple loops instead of calls to memset. I wonder why overlapping memset is an issue? I this some inherited behaviour from dealing with memcpy? In case 4 arrays are zeroed together, about 40 instructions are generated to check for mutual overlap... This does not seem to be necessary. Other compilers (clang, icc) don't do that. See issue here, with assembly generated: https://godbolt.org/z/SSWVhm And I copy the code below for reference too: void test_simple_code(long l, double* mem, long ofs2) { for (long k=0; k
[Bug tree-optimization/93342] New: wrong AVX mask generation with -funsafe-math-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93342 Bug ID: 93342 Summary: wrong AVX mask generation with -funsafe-math-optimizations Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- When trying to produce a xor mask to negate even elements in an AVX vector, gcc produces wrong code with -funsafe-math-optimizations. I've tried several ways, all giving the same wrong answer: a mask negating ALL elements instead of just the even ones. Since the mask is generated using INTEGER arithmetic, I don't understand the issue here. The only correct way with avx is to define a variable with the mask already set. With avx2, one can use integer intrinsics, which will produce correct mask. The code showing the bug can be seen here. https://godbolt.org/z/q9eamc For the record, I also copy the code below. When compiling the following with -O -mavx2 -funsafe-math-optimizations -S, the mask is wrong. Without -funsafe-math-optimizations it is correct. Since the mask is generated using integer arithmetic, I don't understand the issue here, as -funsafe-math-optimizations only affects floating point (according to man page). Even stranger, the same mask, but now xor-ed using integer avx2 intrinsics gives the correct resuts... #include typedef __m128d v2d; typedef __m256d v4d; // generates: vxorpd ymm0, ymm0, YMMWORD PTR wrong_mask v4d negate_even_fail(v4d v) { __m256i mask = _mm256_setr_epi32(0,-2147483648, 0,0, 0,-2147483648, 0,0); return _mm256_xor_pd(v, _mm256_castsi256_pd(mask)); } // generates: vxorpd ymm0, ymm0, YMMWORD PTR correct_mask v4d negate_even_does_not_fail(v4d v) { __m256i mask = _mm256_setr_epi32(0,-2147483648, 0,0, 0,-2147483648, 0,0); return _mm256_castsi256_pd(_mm256_xor_si256(_mm256_castpd_si256(v), mask)); }
[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 --- Comment #3 from N Schaeffer --- Hi, Thanks for pointing out the issue about writing different values. This makes sense. However, since memset deals with bytes, whenever the type of array is floating point data (or anything longer than bytes), it will not be possible to use memset to set different values. Indeed, the code snippet you propose is not compiled with memset for 1.0. So I think only zeros and NaNs are possible to optimize to memset anyway (and some other very special cases, that is probably not worth considering anyway).
[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 --- Comment #5 from N Schaeffer --- Elaborating a bit on this: I can eliminate this problem by using: -O3 -fno-tree-loop-distribute-patterns -fno-tree-loop-vectorize I wonder why -fno-tree-loop-distribute-patterns is not enough ? In that case, I get no calls to memset, but still the write-after-write dependency check. Also, decorating the loop with #pragma omp simd AND compiling with -O3 -march=core-avx2 -fopenmp-simd -fno-tree-loop-distribute-patterns finally generates sensible code. Note that with -fno-tree-loop-distribute-patterns, I still get calls to memset instead of a simd-vectorized loop...
[Bug target/93395] New: AVX2 missed optimization : _mm256_permute_pd() is unfortunately translated into the more expensive VPERMPD instead of the cheap VPERMILPD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93395 Bug ID: 93395 Summary: AVX2 missed optimization : _mm256_permute_pd() is unfortunately translated into the more expensive VPERMPD instead of the cheap VPERMILPD Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- According to Agner Fog's instruction timing tables and to my own measurements, VPERMPD has a 3 cycle latency, while VPERMILPD has a 1 cycle latency on most CPUs. Yet, the intrinsic _mm256_permute_pd() is always translated with VPERMPD, while this intrinsic maps directly to the VPERMILPD instruction. This makes the code SLOWER. It should be the opposite: the _mm256_permute4x64_pd() intrinsic, which maps the VPERMPD instruction should, when possible, be translated into VPERMILPD. Note that clang does the right thing here. The same problem arises for AVX-512. See assembly generated here: https://godbolt.org/z/VZe8qk I replicate the code here for completeness: #include // translated into "vpermpd ymm0, ymm0, 177" // which is OK, but "vpermilpd ymm0, ymm0, 5" does the same thing faster. __m256d perm_missed_optimization(__m256d a) { return _mm256_permute4x64_pd(a,0xB1); } // translated into "vpermpd ymm0, ymm0, 177" // which is 3 times slower than the original intent of "vpermilpd ymm0, ymm0, 5" __m256d perm_pessimization(__m256d a) { return _mm256_permute_pd(a,0x5); } // adequately translated into "vshufpd ymm0, ymm0, ymm0, 5" // which does the same as "vpermilpd ymm0, ymm0, 5" at the same speed. __m256d perm_workaround(__m256d a) { return _mm256_shuffle_pd(a, a, 5); }
[Bug c++/60237] New: isnan fails with -ffast-math
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 Bug ID: 60237 Summary: isnan fails with -ffast-math Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: major Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com With -ffast-math, isnan should return true if passed a NaN value. Otherwise, how is isnan different than (x!=x) ? isnan worked as expected with gcc 4.7, but does not with 4.8.1 and 4.8.2 How can I check if x is a NaN in a portable way (not presuming any compilation option) ?
[Bug c++/60237] isnan fails with -ffast-math
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #2 from N Schaeffer --- Thank you for your answer. My program (which is a computational fluid dynamics solver) is not supposed to produce NaNs. However, when it does (which means something went wrong), I would like to abort the program and return an error instead of continuing crunching NaNs. I also want it to run as fast as possible (hence the -ffast-math option). I would argue that: if printf("%f",x) outputs "NaN", isnan(x) should also be returning true. Do you have a suggestion concerning my last question: How can I check if x is NaN in a portable way (not presuming any compilation option) ?
[Bug c++/60237] isnan fails with -ffast-math
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #4 from N Schaeffer --- int my_isnan(double x){ volatile double y=x; return y!=y; } is translated to: 0x00406cf0 <+0>: movsd QWORD PTR [rsp-0x8],xmm0 0x00406cf6 <+6>: xoreax,eax 0x00406cf8 <+8>: movsd xmm1,QWORD PTR [rsp-0x8] 0x00406cfe <+14>:movsd xmm0,QWORD PTR [rsp-0x8] 0x00406d04 <+20>:comisd xmm1,xmm0 0x00406d08 <+24>:setne al 0x00406d0b <+27>:ret which also fails to detect NaN, which is right according to the documented behaviour of comisd: http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc44.htm
[Bug c++/60237] isnan fails with -ffast-math
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #6 from N Schaeffer --- -fno-builtin-isnan is also interesting, thanks. Is there somewhere a rationale for not making isnan() find NaN's with -ffinite-math-only ?
[Bug tree-optimization/98563] New: regression: vectorization fails while it worked on gcc 9 and earlier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 Bug ID: 98563 Summary: regression: vectorization fails while it worked on gcc 9 and earlier Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- I have found what seems to be a regression. The following code is not compiled to 256-bit AVX when compiled with -fopenmp-simd, while it is fully vectorized without! Here are the resulting code with different options, with gcc 10.1: -O3 -fopenmp-simd => xmm -O3=> ymm -O3 -fopenmp-simd -fno-signed-zeros => ymm gcc 9 and earlier always vectorize to full-width (ymm) #include typedef std::complex cplx; void test(cplx* __restrict__ a, const cplx* b, double c, int N) { #pragma omp simd for (int i=0; i<8*N; i++) { a[i] = c*(a[i]-b[i]); } } See the result on godbolt: https://godbolt.org/z/9ThqKE Also, I discover that no avx512 code is generated for this loop. Is this intended? Is there an option to enable avx512 vectorization?
[Bug tree-optimization/98563] regression: vectorization fails while it worked on gcc 9 and earlier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #1 from N Schaeffer --- I just found the -mprefer-vector-width=512 to force to use zmm. The reported regression however remains.
[Bug tree-optimization/98563] [10/11 Regression] vectorization fails while it worked on gcc 9 and earlier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #3 from N Schaeffer --- I'd like to add that when you say "vectorization of the basic block", the code generated is actually worse than non-vectorized naive code: it handles all loads and arithmetic operations in scalar mode (v*sd instructions) and packs two values into xmm before storing...
[Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 Bug ID: 114107 Summary: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- A simple loop multiplying two arrays, with different multiplicity fails to vectorize efficiently with -O3. Target is AVX x86_64. The loop is the following, where 4 consecutive values in data are multiplied by the same factor : for (int i=0; ihttps://godbolt.org/z/fWj34bbhq
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #3 from N Schaeffer --- I have not benchmarked. For 4 vmulpd doing the actual work, there are more than 40 permute/mov instructions, among which 24 vpermd instructions which have a 3 cycle latency. That is 6 vpermd per vmulpd. There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10 times slower than the vbroadcastsd loop. If you want, I can benchmark it tomorrow. If this is a cost model problem, it is a bad one. Even ignoring the decoding of all these instructions, how can adding 6 vpermd to each vmulpd be faster? I would rather think (hope?) the optimizer does not consider the vbroadcastsd solution at all.
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #4 from N Schaeffer --- ... and thank you for your quick reply!
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #6 from N Schaeffer --- indeed, aarch64 assembly looks very good.
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #9 from N Schaeffer --- In addition, optimizing for size with -Os leads to a non-vectorized double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os) leads to 40 bytes. It is thus also a missed optimization for -Os.
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #10 from N Schaeffer --- intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal code using vbroadcastsd with the following options: -O2 -march=skylake -ftree-vectorize
[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #12 from N Schaeffer --- I found the "offending" option, and it seems to be indeed a cost-model problem as Andrew Pinski said: good code is generated by: gcc -O2 -ftree-vectorize -march=skylake (since gcc 6.1) gcc -O1 -ftree-vectorize -march=skylake (since gcc 8.1) gcc -O3 -fvect-cost-model=very-cheap -march=skylake (with gcc 13.1+) bad code is generated otherwise, and in particular: gcc -O2 -march=skylake (does not vectorize) gcc -O3 -march=skylake (bad vectorization with so many permutations)