[Bug bootstrap/58828] New: Problem compiling gcc 4.8.2 using gcc 4.4.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828 Bug ID: 58828 Summary: Problem compiling gcc 4.8.2 using gcc 4.4.6 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com I am trying to build gcc 4.8.2, and I got following compilation error: make[3]: Entering directory `[path]/gcc/obj/gcc' g++ -g -fkeep-inline-functions -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common -DHAVE_CONFIG_H -DGENERATOR_FILE -o build/genconstants \ build/genconstants.o build/read-md.o build/errors.o ../build-x86_64-unknown-linux-gnu/libiberty/libiberty.a build/genconstants ../../gcc-4.8.2/gcc/config/i386/i386.md \ > tmp-constants.h /bin/sh ../../gcc-4.8.2/gcc/../move-if-change tmp-constants.h insn-constants.h echo timestamp > s-constants g++ -g -fkeep-inline-functions -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common -DHAVE_CONFIG_H -DGENERATOR_FILE -o build/gengtype \ build/gengtype.o build/errors.o build/gengtype-lex.o build/gengtype-parse.o build/gengtype-state.o build/version.o ../build-x86_64-unknown-linux-gnu/libiberty/libiberty.a build/gengtype.o: In function `double_int::operator*=(double_int)': [path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:263: undefined reference to `double_int::operator*(double_int) const' build/gengtype.o: In function `double_int::operator+=(double_int)': [path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:270: undefined reference to `double_int::operator+(double_int) const' build/gengtype.o: In function `double_int::operator-=(double_int)': [path]/gcc/obj/gcc/../../gcc-4.8.2/gcc/double-int.h:277: undefined reference to `double_int::operator-(double_int) const' collect2: ld returned 1 exit status make[3]: *** [build/gengtype] Error 1 make[3]: Leaving directory `[path]/gcc/obj/gcc' gcc is configured this way: ../gcc-4.8.2/configure --prefix=[myprefix] --enable-languages=c,c++ --disable-nls I compile with sources for all needed tools and libs unpacked into gcc dir. Here are versions: binutils-2.23.2.tar.bz2 cloog-0.18.0.tar.gz gcc-4.8.2.tar.bz2 gmp-5.1.3.tar.bz2 isl-0.11.1.tar.bz2 mpc-1.0.1.tar.gz mpfr-3.1.2.tar.bz2 gcc --version gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[Bug bootstrap/58840] New: Problem compiling gcc 4.7.3 using gcc 4.4.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58840 Bug ID: 58840 Summary: Problem compiling gcc 4.7.3 using gcc 4.4.6 Product: gcc Version: 4.7.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com make[3]: Entering directory `[path]/gcc/obj/gcc' build/gengtype \ -S ../../gcc-4.7.3/gcc -I gtyp-input.list -w tmp-gtype.state ../../gcc-4.7.3/gcc/../include/splay-tree.h:55: unidentified type `uintptr_t' ../../gcc-4.7.3/gcc/../include/splay-tree.h:56: unidentified type `uintptr_t' make[3]: *** [s-gtype] Error 1 make[3]: Leaving directory `[path]/gcc/obj/gcc' make[2]: *** [all-stage1-gcc] Error 2 GCC is configured in this way: ../gcc-4.7.3/configure --prefix=[myprefix] --enable-languages=c,c++ --disable-nls Installed compiler: gcc --version gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[Bug bootstrap/58828] Problem compiling gcc 4.8.2 using gcc 4.4.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828 --- Comment #2 from Daniel Fruzynski --- Thanks for reply. As I checked, this also happens when compiling using gcc 4.7.3, so looks that this is more general problem. File [path]/gcc/obj/gcc/config.status contains following entry: configured by [path]/gcc/gcc-4.8.2/gcc/configure, generated by GNU Autoconf 2.64, with options \"'--cache-file=./config.cache' '--with-gnu-as' '--with-gnu-ld' '--prefix=[path]/gcc-4.8.2-linux/' '--disable-nls' '--enable-threads=posix' '--enable-checking=release' '--enable-__cxa_atexit' '--with-tune=generic' '--with-arch_32=i686' '--enable-languages=c,c,c++,lto' '--program-transform-name=s,y,y,' '--disable-option-checking' '--build=x86_64-redhat-linux' '--host=x86_64-redhat-linux' '--target=x86_64-redhat-linux' '--srcdir=../../gcc-4.8.2/gcc' '--disable-intermodule' '--enable-checking=release,types' '--disable-coverage' '--enable-languages=c,c++,lto' 'build_alias=x86_64-redhat-linux' 'host_alias=x86_64-redhat-linux' 'target_alias=x86_64-redhat-linux' 'CC=x86_64-redhat-linux-gcc' 'CFLAGS=-g -fkeep-inline-functions' 'LDFLAGS= ' 'CXX=x86_64-redhat-linux-g++' 'CXXFLAGS=-g -fkeep-inline-functions' 'GMPLIBS=-L[path]/gcc/obj/./gmp/.libs -L[path]/gcc/obj/./mpfr/src/.libs -L[path]/gcc/obj/./mpc/src/.libs -lmpc -lmpfr -lgmp' 'GMPINC=-I[path]/gcc/obj/./gmp -I[path]/gcc/gcc-4.8.2/gmp -I[path]/gcc/obj/./mpfr/src -I[path]/gcc/gcc-4.8.2/mpfr/src -I[path]/gcc/gcc-4.8.2/mpc/src ' 'CLOOGLIBS=' 'CLOOGINC='\" So -fkeep-inline-functions was passed from outside. I checked [path]/gcc/obj/config.status and found this: S["stage1_cflags"]="-g -fkeep-inline-functions" Looks that there is some issue with top-level configure script.
[Bug bootstrap/58828] Problem compiling gcc 4.8.2 using gcc 4.4.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828 --- Comment #3 from Daniel Fruzynski --- OK, I found it. I used script symlink-tree (distributed with binutils) to create symlinks to binutils in gcc source dir. This script removed some gcc source files and replaced them with symlinks to corresponding files in binutils dir. I assumed that it will help me, but it created more problems. I am building gcc without binutils symlinked, and build is on stage 2 now. Look that it will complete successfully. I think that dedicated script to symlink all binutils into gcc dir would be useful. Could you create one?
[Bug bootstrap/58840] Problem compiling gcc 4.7.3 using gcc 4.4.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58840 --- Comment #2 from Daniel Fruzynski --- OK, I found this. I used script symlink-tree to create symlinks to binutils in gcc src dir. This script replaced some files with symlinks to their counterparts in binutil dir, what caused this problem. gcc without these symlinks compiles fine. So this is not an issue.
[Bug c/58988] New: -Werror=missing-include-dirs does not work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58988 Bug ID: 58988 Summary: -Werror=missing-include-dirs does not work Product: gcc Version: 4.7.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com I tried to pass -Werror=missing-include-dirs option to gcc in order to find all non-existing include dirs and found that this option in broken. In gcc 4.5.2 this option is ignored - gcc does not print any message when non-existing include dir is specified. gcc 4.7.3 prints warning only. gcc (both tested versions) changes this warnings into errors when both -Wmissing-include-dirs -Werror options are used, but this is not an option for be because of other warnings which are in my code. I tested this using following command: g++ -c test.cc -o test.o -I/a -Werror=missing-include-dirs
[Bug c/58988] -Werror=missing-include-dirs does not work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58988 --- Comment #1 from Daniel Fruzynski --- gcc 4.8.2 is also affected by this bug - is works in the same way as gcc 4.7.3.
[Bug target/88271] Omit test instruction after add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271 --- Comment #9 from Daniel Fruzynski --- I have idea about alternate approach to this. gcc could try to look for relations between loop control statement, and other statements which modify variables used in that control statement. With such knowledge it could try to reorganize code to better optimize it. This approach would eliminate randomness here.
[Bug target/88271] Omit test instruction after add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271 --- Comment #10 from Daniel Fruzynski --- Here is possible code transformation to equivalent form, where this optimization can be simply applied. This change also has a bit surprising side effect, second nested while loop is unrolled. [code] void test2() { int level = 0; int val = 1; while (1) { while(1) { val = data[level] << 1; ++level; if (val) continue; else break; } while(1) { --level; val = data[level]; if (!val) continue; else break; } } } [/code]
[Bug c/88461] New: AVX512: gcc should keep value in kN registers if possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461 Bug ID: 88461 Summary: AVX512: gcc should keep value in kN registers if possible Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- I tried to write piece of code which used new AVX512 logic instructions which works on kN registers. It turned out that gcc was moving intermediate values back and forth between kN and eax, what resulted in very poor code. Example was compiled using gcc 8.2 with -O3 -march=skylake-avx512 [code] #include #include int test(uint16_t* data, int a) { __m128i v = _mm_load_si128((const __m128i*)data); __mmask8 m = _mm_testn_epi16_mask(v, v); m = _kshiftli_mask16(m, 1); m = _kandn_mask16(m, a); return m; } [/code] [asm] test(unsigned short*, int): vmovdqa64 xmm0, XMMWORD PTR [rdi] kmovw k5, esi vptestnmw k1, xmm0, xmm0 kmovb eax, k1 kmovw k2, eax kshiftlwk0, k2, 1 kmovw eax, k0 movzx eax, al kmovw k4, eax kandnw k3, k4, k5 kmovw eax, k3 movzx eax, al ret [/asm]
[Bug target/88461] AVX512: gcc should keep value in kN registers if possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461 --- Comment #1 from Daniel Fruzynski --- For comparison, this is code generated by icc 19.0.1: [asm] test(unsigned short*, int): vmovdqu xmm0, XMMWORD PTR [rdi] #6.48 vptestnmw k0, xmm0, xmm0#7.18 kmovw k2, esi #11.9 kshiftlw k1, k0, 1 #9.9 kandnwk3, k1, k2#11.9 kmovb k4, k3#13.12 kmovw eax, k4 #13.12 ret #13.12 [/asm]
[Bug c/81665] Please introduce flags attribute for enums which will mimic one from C#
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81665 --- Comment #4 from Daniel Fruzynski --- @Jonathan Wakely: constexpr requires C++11. When I reported this bug, we still were at C++98 with most of out codebase.
[Bug target/88461] AVX512: gcc should keep value in kN registers if possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461 --- Comment #3 from Daniel Fruzynski --- Good catch, mask should be 16-bit. Here is fixed version: [code] #include #include int test(uint16_t* data, int a) { __m128i v = _mm_load_si128((const __m128i*)data); __mmask16 m = _mm_testn_epi16_mask(v, v); m = _kshiftli_mask16(m, 1); m = _kandn_mask16(m, a); return m; } [/code] [asm] test(unsigned short*, int): vmovdqa64 xmm0, XMMWORD PTR [rdi] kmovw k4, esi vptestnmw k1, xmm0, xmm0 kmovb eax, k1 kmovw k2, eax kshiftlwk0, k2, 1 kandnw k3, k0, k4 kmovw eax, k3 ret [/asm] This still can be optimized, there is no need to move value from k1 to eax and then to k2 - vptestnmw zeroes upper bits if k register.
[Bug c/88465] New: AVX512: optimize loading of constant values to kN registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465 Bug ID: 88465 Summary: AVX512: optimize loading of constant values to kN registers Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- When constant value is loaded into kN register, gcc puts it into eax first, and then moved to kN register: [code] #include #include __mmask8 test(__mmask8 m) { __mmask8 m2 = _kand_mask8(m, 3); return m2; } [/code] [asm] test(unsigned char): mov eax, 3 kmovb k1, eax kmovb k2, edi kandb k0, k1, k2 kmovb eax, k0 ret [/asm] icc uses one instruction for this. https://godbolt.org/ displayed it as "null", but most probably this is wrong name: [asm] test(unsigned char): vkmovbk0, edi #6.19 null k1, 3 #6.19 kandb k2, k0, k1#6.19 vkmovbeax, k2 #6.19 ret #7.12 [/asm] You can also use instructions kxor and kxnor to load 0 and -1.
[Bug target/88465] AVX512: optimize loading of constant values to kN registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465 --- Comment #2 from Daniel Fruzynski --- I have logged issue for CompileExplorer to clarify this null instruction: https://github.com/mattgodbolt/compiler-explorer/issues/1220
[Bug target/88465] AVX512: optimize loading of constant values to kN registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465 --- Comment #3 from Daniel Fruzynski --- This "null" ia an icc bug. Matt Godbolt from Compiler Explorer filed a bug with Intel: ref 03997020
[Bug target/88473] New: AVX512: constant folding on mask does not remove unnecessary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473 Bug ID: 88473 Summary: AVX512: constant folding on mask does not remove unnecessary instructions Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #include void test(void* data, void* data2) { __m128i v = _mm_load_si128((__m128i const*)data); __mmask8 m = _mm_testn_epi16_mask(v, v); m = _kor_mask8(m, 0x0f); m = _kor_mask8(m, 0xf0); v = _mm_maskz_add_epi16(m, v, v); _mm_store_si128((__m128i*)data2, v); } [/code] Code compiled using gcc 8.2 with -O3 -march=skylake-avx512 . gcc was able to fold constant expressions and simplify masked vector add to non-masked one. However original version of folded expression is still present in output: [asm] test(void*, void*): vmovdqa64 xmm0, XMMWORD PTR [rdi] mov eax, 15 vptestnmw k1, xmm0, xmm0 kmovb k2, eax vpaddw xmm0, xmm0, xmm0 mov eax, -16 kmovb k3, eax vmovaps XMMWORD PTR [rsi], xmm0 korb k0, k1, k2 korb k0, k0, k3 ret [/asm] clang properly cleaned it up: [asm] test(void*, void*): # @test(void*, void*) vmovdqa xmm0, xmmword ptr [rdi] vpaddw xmm0, xmm0, xmm0 vmovdqa xmmword ptr [rsi], xmm0 ret [/asm]
[Bug middle-end/88476] New: Optimize expressions which uses vector, mask and general purpose registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88476 Bug ID: 88476 Summary: Optimize expressions which uses vector, mask and general purpose registers Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- I was playing with Compiler Explorer to see how compilers can optimize various pieces of code. I found next version of clang (version 8.0.0 (trunk 348905)) can optimize expressions which uses vector, mask and general purpose registers. Such approach opens new optimization possibilities. Here are two example functions which demonstrates this: [code] #include void test1(void* data1, void* data2) { __m128i v1 = _mm_load_si128((__m128i const*)data1); __m128i v2 = _mm_load_si128((__m128i const*)data2); __mmask8 m1 = _mm_testn_epi16_mask(v1, v1); __mmask8 m2 = _mm_testn_epi16_mask(v2, v2); __mmask8 m = (m1 | 3) & (m2 | 3); v1 = _mm_maskz_add_epi16(m, v1, v2); _mm_store_si128((__m128i*)data2, v1); } void test2(void* data1, void* data2) { __m128i v1 = _mm_load_si128((__m128i const*)data1); __m128i v2 = _mm_load_si128((__m128i const*)data2); __mmask8 m1 = _mm_testn_epi16_mask(v1, v1); __mmask8 m2 = _mm_testn_epi16_mask(v2, v2); m1 = _kor_mask8(m1, 3); m2 = _kor_mask8(m2, 3); __mmask8 m = _kand_mask8(m1, m2); v1 = _mm_maskz_add_epi16(m, v1, v2); _mm_store_si128((__m128i*)data2, v1); } [/code] When compiled using clang with -O3 -march=skylake-avx512, both are optimized to the same code: [asm] test(void*, void*): # @test(void*, void*) vmovdqa xmm0, xmmword ptr [rdi] vmovdqa xmm1, xmmword ptr [rsi] vpor xmm2, xmm1, xmm0 vptestnmw k0, xmm2, xmm2 mov al, 3 kmovd k1, eax korb k1, k0, k1 vpaddw xmm0 {k1} {z}, xmm1, xmm0 vmovdqa xmmword ptr [rsi], xmm0 ret [/asm] gcc 9.0.0 20181211 (experimental) produces this: [asm] test1(void*, void*): vmovdqa64 xmm1, XMMWORD PTR [rsi] vmovdqa64 xmm0, XMMWORD PTR [rdi] vptestnmw k1, xmm1, xmm1 vptestnmw k2{k1}, xmm0, xmm0 kmovb eax, k2 or eax, 3 kmovb k3, eax vpaddw xmm0{k3}{z}, xmm0, xmm1 vmovaps XMMWORD PTR [rsi], xmm0 ret test2(void*, void*): vmovdqa64 xmm0, XMMWORD PTR [rdi] vmovdqa64 xmm1, XMMWORD PTR [rsi] vptestnmw k1, xmm0, xmm0 vptestnmw k3, xmm1, xmm1 mov eax, 3 kmovb k2, eax korb k1, k1, k2 korb k0, k3, k2 kandb k1, k1, k0 vpaddw xmm0{k1}{z}, xmm0, xmm1 vmovaps XMMWORD PTR [rsi], xmm0 ret [/asm]
[Bug target/88473] AVX512: constant folding on mask does not remove unnecessary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473 --- Comment #2 from Daniel Fruzynski --- I was playing with Compiler Explorer, to see how compilers optimize various pieces of code. I found that next clang version (currently trunk) will be able to analyze expressions which spans over vectors, masks and GPRs. I logged Bug 88476 to do something similar in gcc, please take a look. I think such approach as in clang would be more beneficial. In the past I also thought about template-based library, which would wrap vector operations. One of unique concepts was to create separate types to hold vector with bool values, and another one for int masks. With lazy instantiation this should lead to faster resulting code. I did not try to write it yet, but overall this approach look promising for me. With it such cases as in this bug can appear as a side effect of inlining.
[Bug middle-end/88487] New: union prevents autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487 Bug ID: 88487 Summary: union prevents autovectorization Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- When pointer to data is inside union, loops are not autovectorized. This also happen when I removed "i" field from union, so it had only one field. Code compiled with -O3 -mavx [code] struct S1 { union { double* __restrict__ * __restrict__ d; int* __restrict__ * __restrict__ i; } u; }; struct S2 { double* __restrict__ * __restrict__ d; }; void test1(S1* __restrict__ s1, S1* __restrict__ s2) { for (int n = 0; n < 2; ++n) { s1->u.d[n][0] = s2->u.d[n][0]; s1->u.d[n][1] = s2->u.d[n][1]; } } void test2(S2* __restrict__ s1, S2* __restrict__ s2) { for (int n = 0; n < 2; ++n) { s1->d[n][0] = s2->d[n][0]; s1->d[n][1] = s2->d[n][1]; } } [/code] [asm] test1(S1*, S1*): mov rdx, QWORD PTR [rsi] mov rax, QWORD PTR [rdi] mov rsi, QWORD PTR [rdx] mov rcx, QWORD PTR [rax] mov rdx, QWORD PTR [rdx+8] mov rax, QWORD PTR [rax+8] vmovsd xmm0, QWORD PTR [rsi] vmovsd QWORD PTR [rcx], xmm0 vmovsd xmm0, QWORD PTR [rsi+8] vmovsd QWORD PTR [rcx+8], xmm0 vmovsd xmm0, QWORD PTR [rdx] vmovsd QWORD PTR [rax], xmm0 vmovsd xmm0, QWORD PTR [rdx+8] vmovsd QWORD PTR [rax+8], xmm0 ret test2(S2*, S2*): mov rdx, QWORD PTR [rsi] mov rax, QWORD PTR [rdi] mov rcx, QWORD PTR [rdx] mov rdx, QWORD PTR [rdx+8] vmovupd xmm0, XMMWORD PTR [rcx] mov rcx, QWORD PTR [rax] mov rax, QWORD PTR [rax+8] vmovups XMMWORD PTR [rcx], xmm0 vmovupd xmm0, XMMWORD PTR [rdx] vmovups XMMWORD PTR [rax], xmm0 ret [/asm]
[Bug middle-end/88487] union prevents autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487 --- Comment #1 from Daniel Fruzynski --- Update: when pointers to data are copied to local variables like below, autovectorization starts working again. [code] void test3(S2* __restrict__ s1, S2* __restrict__ s2) { double* __restrict__ * __restrict__ d1 = s1->d; double* __restrict__ * __restrict__ d2 = s2->d; for (int n = 0; n < 2; ++n) { d1[n][0] = d2[n][0]; d1[n][1] = d2[n][1]; } } [/code]
[Bug middle-end/88487] union prevents autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487 --- Comment #2 from Daniel Fruzynski --- I spotted that test3 in previous comment uses structure S2 which does not have union inside. When I changes it to use S1, I got non-vectorized code. So this workaround does not work.
[Bug middle-end/88490] New: Missed autovectorization when indices are different
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490 Bug ID: 88490 Summary: Missed autovectorization when indices are different Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Code below reads and writes data using different indices what is checked by "if" above loop. This can be autovectorized, as both memory areas do not overlap. Code compiled with -O3 -march=skylake-avx512 [code] struct S { double* __restrict__ * __restrict__ d; }; void test(S* __restrict__ s, int n, int k) { if (n > k) { for (int n = 0; n < 2; ++n) { s->d[n][0] = s->d[k][0]; s->d[n][1] = s->d[k][1]; } } } [/code] [asm] test(S*, int, int): cmp esi, edx jle .L3 mov rcx, QWORD PTR [rdi] movsx rdx, edx mov rax, QWORD PTR [rcx+rdx*8] mov rdx, QWORD PTR [rcx] vmovsd xmm0, QWORD PTR [rax] vmovsd QWORD PTR [rdx], xmm0 vmovsd xmm0, QWORD PTR [rax+8] vmovsd QWORD PTR [rdx+8], xmm0 vmovsd xmm0, QWORD PTR [rax] mov rdx, QWORD PTR [rcx+8] vmovsd QWORD PTR [rdx], xmm0 vmovsd xmm0, QWORD PTR [rax+8] vmovsd QWORD PTR [rdx+8], xmm0 .L3: ret [/asm]
[Bug middle-end/88490] Missed autovectorization when indices are different
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490 --- Comment #1 from Daniel Fruzynski --- Ehh, small typo. This is correct version, also not vectorized: [code] struct S { double* __restrict__ * __restrict__ d; }; void test(S* __restrict__ s, int n, int k) { if (n > k) { for (int i = 0; i < 2; ++i) { s->d[n][0] = s->d[k][0]; s->d[n][1] = s->d[k][1]; } } } [/code] [asm] test(S*, int, int): cmp esi, edx jle .L3 mov rax, QWORD PTR [rdi] movsx rdx, edx mov rdx, QWORD PTR [rax+rdx*8] movsx rsi, esi vmovsd xmm0, QWORD PTR [rdx] mov rax, QWORD PTR [rax+rsi*8] vmovsd QWORD PTR [rax], xmm0 vmovsd xmm0, QWORD PTR [rdx+8] vmovsd QWORD PTR [rax+8], xmm0 vmovsd xmm0, QWORD PTR [rdx] vmovsd QWORD PTR [rax], xmm0 vmovsd xmm0, QWORD PTR [rdx+8] vmovsd QWORD PTR [rax+8], xmm0 .L3: ret [/asm]
[Bug middle-end/88490] Missed autovectorization when indices are different
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490 --- Comment #3 from Daniel Fruzynski --- In this case s->d is pointer to pointer to double, and both pointer levels have restrict qualifier. I wonder if you could add some tag that s->d[n] and s->d[k] points to separate memory areas. This tag could be later used to determine that s->d[n][0] and s->d[k][0] also do not overlap.
[Bug middle-end/88487] union prevents autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487 --- Comment #4 from Daniel Fruzynski --- OK, I see. Is there any workaround for this? I tried to assign pointer to local variable directly and with intermediate casting via void*, but it did not help. Casting S1* to S2* also does not work.
[Bug c/88540] New: Issues with vectorization of min/max operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540 Bug ID: 88540 Summary: Issues with vectorization of min/max operations Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- 1st issue: [code] #define SIZE 2 void test(double* __restrict d1, double* __restrict d2, double* __restrict d3) { for (int n = 0; n < SIZE; ++n) { d3[n] = d1[n] < d2[n] ? d1[n] : d2[n]; } } [code] When this is compiled with for SSE2, gcc produces non vectorized code: [asm] test(double*, double*, double*): vmovsd xmm0, QWORD PTR [rdi] vminsd xmm0, xmm0, QWORD PTR [rsi] vmovsd QWORD PTR [rdx], xmm0 vmovsd xmm0, QWORD PTR [rdi+8] vminsd xmm0, xmm0, QWORD PTR [rsi+8] vmovsd QWORD PTR [rdx+8], xmm0 ret [/asm] When SIZE is changed to 3 or greater, code gets vectorized properly. I thought that this may be some workaround for old CPU which was slower there, but this also happen when compiling with "-O3 -march=skylake". I also checked with SIZE 6, and got 1 AVX op and 2 scalar SSE ones. Looks that this is an off-by-one bug. The same happen for code with other relational operators (>, <=, >=). 2nd issue: when compiling for AVX512, gcc does not use new instructions which use ZMM registers, it still generates code for YMM ones.
[Bug c/88542] New: Optimize symmetric range check
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542 Bug ID: 88542 Summary: Optimize symmetric range check Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #include bool test1(double d, double max) { return (d < max) && (d > -max); } bool test2(double d, double max) { return fabs(d) < max; } [/code] When code checks if some number d is in (or outside of) symmetric range like (-max, max), code from test1() can be replaced with one from test2(). This of course assumes that expression does not produce any side effects. This can be done nicely for floating point numbers stored in IEEE format, what leads to faster code: [asm] test1(double, double): vcomisd xmm1, xmm0 jbe .L6 vxorpd xmm1, xmm1, XMMWORD PTR .LC0[rip] vcomisd xmm0, xmm1 setaal ret .L6: xor eax, eax ret test2(double, double): vandpd xmm0, xmm0, XMMWORD PTR .LC1[rip] vcomisd xmm1, xmm0 setaal ret [/asm] For integer types stored in two's complement format similar change gives slower code. However on platforms which uses different integer format with dedicated sign bit this optimizations may be beneficial.
[Bug middle-end/88542] Optimize symmetric range check
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542 --- Comment #2 from Daniel Fruzynski --- No, code with -ffast-math is the same. BTW, fabs(NaN) is NaN, so result is the same as before (false).
[Bug tree-optimization/88540] Issues with vectorization of min/max operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540 --- Comment #3 from Daniel Fruzynski --- Looks that AARCH64 is also affected. This is output from gcc 8.2 for SIZE=2: [asm] test(double*, double*, double*): ldp d1, d0, [x0] ldp d3, d2, [x1] fcmpe d1, d3 fcsel d1, d1, d3, mi fcmpe d0, d2 fcsel d0, d0, d2, mi stp d1, d0, [x2] ret [/asm] And this is for SIZE=4: [asm] test(double*, double*, double*): ldr q5, [x0] ldr q3, [x1] ldr q4, [x0, 16] ldr q2, [x1, 16] fcmgt v1.2d, v3.2d, v5.2d fcmgt v0.2d, v2.2d, v4.2d bsl v1.16b, v5.16b, v3.16b bsl v0.16b, v4.16b, v2.16b str q1, [x2] str q0, [x2, 16] ret [/asm]
[Bug middle-end/88487] union prevents autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487 --- Comment #6 from Daniel Fruzynski --- Not good. Fortunately I found workaround. This is probably the best what one can get: [code] #include #include template struct TypeHelper { constexpr unsigned offset(); operator Type&() { uint8_t*__restrict p = (uint8_t*__restrict)this - offset(); Type*__restrict pt = (Type*__restrict)p; return *pt; } }; struct S { struct Union { void*__restrict*__restrict ptr; TypeHelper d; } u; }; template<> constexpr unsigned TypeHelper::offset() { return offsetof(S::Union, d) - offsetof(S::Union, ptr); } void test(S* __restrict s1, S* __restrict s2) { for (int n = 0; n < 2; ++n) { s1->u.d[n][0] = s2->u.d[n][0]; s1->u.d[n][1] = s2->u.d[n][1]; } } [/code]
[Bug middle-end/88569] New: Track relations between variable values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88569 Bug ID: 88569 Summary: Track relations between variable values Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- This example comes from code which could be compiled for various CPUs, and had dedicated sections for AVX and SSE2. I left original ifdefs in comments. When 1st loop (for AVX) ends, following relations is true: (cnt - n <= 3). Similarly after 2nd loop this is true: (cnt - n <= 1). With such knowledge it is possible to optimize code of bar() to baz(). This eliminates two condition checks (after 2nd and 3rd loop), and one increment (for 3rd loop). It would be nice if gcc could perform such transformation automatically. [code] void foo(int n); void bar(int cnt) { int n = 0; //#ifdef __AVX__ for (; n < cnt - 3; n += 4) foo(n); //#endif //#ifdef __SSE2__ for (; n < cnt - 1; n += 2) foo(n); //#endif for (; n < cnt; n += 1) foo(n); } void baz(int cnt) { int n = 0; for (; n < cnt - 3; n += 4) foo(n); if (n < cnt - 1) { foo(n); n += 2; } if (n < cnt) foo(n); } [/code] [asm] bar(int): pushr13 pushr12 mov r12d, edi pushrbp lea ebp, [rdi-3] pushrbx xor ebx, ebx sub rsp, 8 testebp, ebp jle .L5 .L2: mov edi, ebx add ebx, 4 callfoo(int) cmp ebx, ebp jl .L2 lea eax, [r12-4] shr eax, 2 lea ebx, [4+rax*4] .L5: lea ebp, [r12-1] cmp ebp, ebx jle .L3 mov edi, ebx lea r13d, [rbx+2] callfoo(int) cmp ebp, r13d jle .L8 mov edi, r13d callfoo(int) .L8: lea edi, [r12-2] sub edi, ebx mov ebx, edi and ebx, -2 add ebx, r13d .L3: cmp r12d, ebx jle .L14 mov edi, ebx callfoo(int) lea edi, [rbx+1] cmp r12d, edi jg .L17 .L14: add rsp, 8 pop rbx pop rbp pop r12 pop r13 ret .L17: add rsp, 8 pop rbx pop rbp pop r12 pop r13 jmp foo(int) baz(int): pushr12 mov r12d, edi pushrbp lea ebp, [rdi-3] pushrbx xor ebx, ebx testebp, ebp jle .L19 .L20: mov edi, ebx add ebx, 4 callfoo(int) cmp ebx, ebp jl .L20 lea eax, [r12-4] shr eax, 2 lea ebx, [4+rax*4] .L19: lea eax, [r12-1] cmp eax, ebx jg .L27 cmp ebx, r12d jl .L28 .L25: pop rbx pop rbp pop r12 ret .L27: mov edi, ebx add ebx, 2 callfoo(int) cmp ebx, r12d jge .L25 .L28: mov edi, ebx pop rbx pop rbp pop r12 jmp foo(int) [/asm]
[Bug middle-end/88570] New: Missing or ineffective vectorization of scatter load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570 Bug ID: 88570 Summary: Missing or ineffective vectorization of scatter load Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] void test1(int*__restrict n1, int*__restrict n2, int*__restrict n3, int*__restrict n4) { for (int n = 0; n < 8; ++n) { if (n1[n] > 0) n2[n] = n3[n]; else n2[n] = n4[n]; } } void test2(double*__restrict d1, double*__restrict d2, double*__restrict d3, double*__restrict d4) { for (int n = 0; n < 4; ++n) { if (d1[n] > 0.0) d2[n] = d3[n]; else d2[n] = d4[n]; } } [/code] Code like above is vectorized properly when global variables are used. However when code has to work on pointers passed as function arguments, vectorization is not performed or performed ineffectively. 1. Compilation with -O3 -msse2: no vectorization at all, scalar code is generated. It is long so I do not paste it here. 2. Compilation with -O3 -msse4.1: no vectorization at all 3. Compilation with -O3 -mavx or -march=sandybridge: code for test1() is still not vectorized (somewhat expected, as int operations are in AVX2). Output for test2() is below. As you can see, generated code performs masked loads for d3 and d4, and then used blend to create final result. When global vars are used, masked loads are not used, only blend. Additionally xor mask is loaded from memory instead of using cmpeq instruction. [asm] test2(double*, double*, double*, double*): vmovupd xmm3, XMMWORD PTR [rdi] vinsertf128 ymm1, ymm3, XMMWORD PTR [rdi+16], 0x1 vxorpd xmm0, xmm0, xmm0 vcmpltpdymm1, ymm0, ymm1 vmaskmovpd ymm2, ymm1, YMMWORD PTR [rdx] vxorps ymm0, ymm1, YMMWORD PTR .LC0[rip] vmaskmovpd ymm0, ymm0, YMMWORD PTR [rcx] vblendvpd ymm0, ymm0, ymm2, ymm1 vmovups XMMWORD PTR [rsi], xmm0 vextractf128XMMWORD PTR [rsi+16], ymm0, 0x1 vzeroupper ret .LC0: .quad -1 .quad -1 .quad -1 .quad -1 [/asm] 4. Compilation with -O3 -march=haswell: code similar as above, with both masked loads and blend. This time compiler generated vpcmpeqd to load xor mask. This also happen when -mavx2 is used instead of -march=haswell. [asm] test1(int*, int*, int*, int*): vmovdqu ymm1, YMMWORD PTR [rdi] vpxor xmm0, xmm0, xmm0 vpcmpgtdymm1, ymm1, ymm0 vpmaskmovd ymm2, ymm1, YMMWORD PTR [rdx] vpcmpeqdymm0, ymm1, ymm0 vpmaskmovd ymm0, ymm0, YMMWORD PTR [rcx] vpblendvb ymm0, ymm0, ymm2, ymm1 vmovdqu YMMWORD PTR [rsi], ymm0 vzeroupper ret test2(double*, double*, double*, double*): vxorpd xmm0, xmm0, xmm0 vcmpltpdymm1, ymm0, YMMWORD PTR [rdi] vpcmpeqdymm0, ymm0, ymm0 vmaskmovpd ymm2, ymm1, YMMWORD PTR [rdx] vpxor ymm0, ymm0, ymm1 vmaskmovpd ymm0, ymm0, YMMWORD PTR [rcx] vblendvpd ymm0, ymm0, ymm2, ymm1 vmovupd YMMWORD PTR [rsi], ymm0 vzeroupper ret [/asm] 4. Compilation with -O3 -march=skylake-avx512: masked loads and blend used again. This time masked loads uses kN registers to store mask. test1() performs comparison twice to get negated value. test2() uses single comparison, but to negate it it moves value to eax and then back (I will log a separate bug for this part, as it has other implications). Code which uses global variables only uses blend with mask in ymm register. [asm] test1(int*, int*, int*, int*): vmovdqu32 ymm0, YMMWORD PTR [rdi] vpxor xmm2, xmm2, xmm2 vpcmpd k1, ymm0, ymm2, 6 vpcmpgtdymm3, ymm0, ymm2 vmovdqu32 ymm1{k1}{z}, YMMWORD PTR [rdx] vpcmpd k1, ymm0, ymm2, 2 vmovdqu32 ymm0{k1}{z}, YMMWORD PTR [rcx] vpblendvb ymm0, ymm0, ymm1, ymm3 vmovdqu32 YMMWORD PTR [rsi], ymm0 vzeroupper ret test2(double*, double*, double*, double*): vmovupd ymm0, YMMWORD PTR [rdi] vxorpd xmm1, xmm1, xmm1 vcmppd k1, ymm0, ymm1, 14 vcmpltpdymm1, ymm1, ymm0 kmovb eax, k1 not eax vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx] kmovb k2, eax vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx] vblendvpd ymm0, ymm0, ymm2, ymm1 vmovupd YMMWORD PTR [rsi], ymm0 vzeroupper ret [/asm] 5. I tried to compile this code using icc, and got this. As you can see, it uses masked move instead of blend. I did not check if it o
[Bug target/88571] New: AVX512: when calculating logical expression with all values in kN registers, do not use GPRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571 Bug ID: 88571 Summary: AVX512: when calculating logical expression with all values in kN registers, do not use GPRs Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- This is a side effect of finding Bug 88570. I have noticed that when gcc has to generate code for logical expression with all values already stored in kN registers, it moves them to GPRs, performs calculation on them and moved result back. Such situation may happen as a side effect of optimizations in gcc. It is also move convenient to use C/C++ operators to write expressions instead of intrinsics, so some people may prefer to use them. It probably can also happen as a side effect of interaction of code optimized by gcc with user code. When logical expression is written using intrinsics, values stays in kN registers as expected. Code below was compiled with -O3 -march=skylake-avx512. test1 and test2 are examples of code with C/C++ operators. test3 is an example of not introduced by gcc during optimization. This last example is also in Bug 88570, which I logged to fix inefficient optimizations. [code] #include void test1(int*__restrict n1, int*__restrict n2, int*__restrict n3, int*__restrict n4) { __m256i v = _mm256_loadu_si256((__m256i*)n1); __mmask8 m = _mm256_cmpgt_epi32_mask(v, _mm256_set1_epi32(1)); m = ~m; _mm256_mask_storeu_epi32((__m256i*)n2, m, v); } void test2(int*__restrict n1, int*__restrict n2, int*__restrict n3, int*__restrict n4) { __m256i v1 = _mm256_loadu_si256((__m256i*)n1); __m256i v2 = _mm256_loadu_si256((__m256i*)n1); __m256i v0 = _mm256_set1_epi32(2); __mmask8 m1 = _mm256_cmpgt_epi32_mask(v1, _mm256_set1_epi32(1)); __mmask8 m2 = _mm256_cmpgt_epi32_mask(v2, _mm256_set1_epi32(2)); __mmask8 m = ~(m1 | m2); _mm256_mask_storeu_epi32((__m256i*)n2, m, v1); } void test3(double*__restrict d1, double*__restrict d2, double*__restrict d3, double*__restrict d4) { for (int n = 0; n < 4; ++n) { if (d1[n] > 0.0) d2[n] = d3[n]; else d2[n] = d4[n]; } } [/code] [asm] test1(int*, int*, int*, int*): vmovdqu64 ymm0, YMMWORD PTR [rdi] vpcmpgtdk1, ymm0, YMMWORD PTR .LC0[rip] kmovb eax, k1 not eax kmovb k2, eax vmovdqu32 YMMWORD PTR [rsi]{k2}, ymm0 vzeroupper ret test2(int*, int*, int*, int*): vmovdqu64 ymm1, YMMWORD PTR [rdi] vpcmpgtdk1, ymm1, YMMWORD PTR .LC0[rip] vpcmpgtdk2, ymm1, YMMWORD PTR .LC1[rip] kmovb edx, k1 kmovb eax, k2 or eax, edx not eax kmovb k3, eax vmovdqu32 YMMWORD PTR [rsi]{k3}, ymm1 vzeroupper ret test3(double*, double*, double*, double*): vmovupd ymm0, YMMWORD PTR [rdi] vxorpd xmm1, xmm1, xmm1 vcmppd k1, ymm0, ymm1, 14 vcmpltpdymm1, ymm1, ymm0 kmovb eax, k1 not eax vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx] kmovb k2, eax vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx] vblendvpd ymm0, ymm0, ymm2, ymm1 vmovupd YMMWORD PTR [rsi], ymm0 vzeroupper ret .LC0: .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 .LC1: .long 2 .long 2 .long 2 .long 2 .long 2 .long 2 .long 2 .long 2 [/asm]
[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571 --- Comment #2 from Daniel Fruzynski --- Yes. Issue still exists in g++ (GCC-Explorer-Build) 9.0.0 20181219 (experimental).
[Bug target/88570] Missing or ineffective vectorization of scatter load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570 --- Comment #2 from Daniel Fruzynski --- In g++ (GCC-Explorer-Build) 9.0.0 20181219 (experimental) this still exists.
[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571 --- Comment #3 from Daniel Fruzynski --- I have checked svn head version (20181221), issue is still there.
[Bug target/88570] Missing or ineffective vectorization of scatter load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570 --- Comment #3 from Daniel Fruzynski --- I have checked svn head version (20181221), issue is still there.
[Bug middle-end/88575] New: gcc got confused by different comparison operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575 Bug ID: 88575 Summary: gcc got confused by different comparison operators Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- In test() gcc is not able to determine that for a==b it does not have to evaluate 2nd comparison and can use value of a if 1st comparison is true. When operators are swapped like in test2() or are the same, code is optimized. [code] double test(double a, double b) { if (a <= b) return a < b ? a : b; return 0.0; } double test2(double a, double b) { if (a < b) return a <= b ? a : b; return 0.0; } [/code] [asm] test(double, double): vcomisd xmm1, xmm0 jnb .L10 vxorpd xmm0, xmm0, xmm0 ret .L10: vminsd xmm0, xmm0, xmm1 ret test2(double, double): vcmpnltsd xmm1, xmm0, xmm1 vxorpd xmm2, xmm2, xmm2 vblendvpd xmm0, xmm0, xmm2, xmm1 ret [/asm]
[Bug middle-end/88575] gcc got confused by different comparison operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575 --- Comment #2 from Daniel Fruzynski --- Code was compiled with -O3 -march=skylake. I have tried to add -fno-signed-zeros and -fsigned-zeros, and got the same output for both cases.
[Bug middle-end/88575] gcc got confused by different comparison operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575 --- Comment #3 from Daniel Fruzynski --- I have tried to compile with -O3 -march=skylake -ffast-math and got this: [asm] test(double, double): vminsd xmm2, xmm0, xmm1 vcmplesdxmm0, xmm0, xmm1 vxorpd xmm1, xmm1, xmm1 vblendvpd xmm0, xmm1, xmm2, xmm0 ret test2(double, double): vminsd xmm2, xmm0, xmm1 vcmpltsdxmm0, xmm0, xmm1 vxorpd xmm1, xmm1, xmm1 vblendvpd xmm0, xmm1, xmm2, xmm0 ret [/asm] And this is for -O3 -march=skylake -funsafe-math-optimizations. As you can see, one instruction was eliminated from test2(). For some reason it was not eliminated from test() function. I checked that -ffinite-math-only present in -ffast-math prevented elimination of this extra instruction. [asm] test(double, double): vminsd xmm2, xmm0, xmm1 vcmplesdxmm0, xmm0, xmm1 vxorpd xmm1, xmm1, xmm1 vblendvpd xmm0, xmm1, xmm2, xmm0 ret test2(double, double): vcmpnltsd xmm1, xmm0, xmm1 vxorpd xmm2, xmm2, xmm2 vblendvpd xmm0, xmm0, xmm2, xmm1 ret [/asm]
[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782 Daniel Fruzynski changed: What|Removed |Added CC||bugzilla@poradnik-webmaster ||a.com --- Comment #3 from Daniel Fruzynski --- Cygwin (x86_64-pc-cygwin) is also affected. I have encountered this bug on gcc 7.4.0. Could you add new option which would remove XMM16+ registers from available registers pool? It could be used as an easy to use workaround until you fix it properly.
[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782 --- Comment #4 from Daniel Fruzynski --- I have found that I can use -ffixed-reg option for this. It allows to eliminate one register, so I have to use it 16 times to eliminate all xmm16..31 registers. It would be handy to have another option which would allow to disable all registers from this group together.
[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782 --- Comment #5 from Daniel Fruzynski --- I got following link: https://stackoverflow.com/questions/53733624/is-xmm8-register-value-preserved-across-calls/53733767#53733767 Quote from it: "Any additional registers for newer instruction sets are volatile by default. This includes the upper parts of YMM0-15 and ZMM0-15 as well as ?MM16-31 if present.". So it looks that gcc should not generate .seh_savexmm for xmm16..31 at all.
[Bug c++/87729] Please include -Woverloaded-virtual in -Wall
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87729 --- Comment #2 from Daniel Fruzynski --- Here you are: [code] class Foo { public: virtual void f(int); }; class Bar : public Foo { public: virtual void f(short); }; [/code]
[Bug c/88679] New: SSE2 intrinsics are available by default on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679 Bug ID: 88679 Summary: SSE2 intrinsics are available by default on x86 Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- SSE2 intrinsics are available by default when compiling code for 32-bit x86. Code below compiles fine with options -m32 -O3. I had to add -mno-sse2 to get an error. Fortunately __SSE2__ is not defined by default, so code can rely on it. [code] #include void test(__m128i const* m) { __m128i v = _mm_load_si128(m); } [/code]
[Bug target/88679] SSE2 intrinsics are available by default on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679 --- Comment #2 from Daniel Fruzynski --- I used compiler at https://godbolt.org/. Here are outputs for both commands: $ gcc -v Using built-in specs. COLLECT_GCC=/opt/compiler-explorer/gcc-snapshot/bin/g++ Target: x86_64-linux-gnu Configured with: ../gcc-trunk-20190103/configure --prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build Thread model: posix gcc version 9.0.0 20190102 (experimental) (GCC-Explorer-Build) COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o' '/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s' '-masm=intel' '-S' '-v' '-shared-libgcc' '-mtune=generic' '-march=x86-64' /opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/cc1plus -quiet -v -imultiarch x86_64-linux-gnu -iprefix /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/ -D_GNU_SOURCE -quiet -dumpbase example.cpp -masm=intel -mtune=generic -march=x86-64 -auxbase-strip /tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s -g -version -fdiagnostics-color=always -o /tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental) (x86_64-linux-gnu) compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.18-GMP GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 ignoring nonexistent directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include" ignoring duplicate directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0" ignoring duplicate directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu" ignoring duplicate directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward" ignoring duplicate directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include" ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu" ignoring duplicate directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed" ignoring nonexistent directory "/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include" #include "..." search starts here: #include <...> search starts here: /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0 /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed /usr/local/include /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../include /usr/include/x86_64-linux-gnu /usr/include End of search list. GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental) (x86_64-linux-gnu) compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.18-GMP GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 Compiler executable checksum: f724e483fb841047a948ffa41ca3218a COMPILER_PATH=/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/bin/ LIBRARY_PATH=/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../lib64/:/lib/x86_64-linux-gnu/:/lib/../lib64/:/usr/lib/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/..
[Bug target/71659] _xgetbv intrinsic missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659 Daniel Fruzynski changed: What|Removed |Added CC||bugzilla@poradnik-webmaster ||a.com --- Comment #4 from Daniel Fruzynski --- This intrinsics was added in gcc 8. Initial implementation was buggy (see r85684) and was fixed in 8.2 However there is one more issue here: Intel Intrinsics Guide says that it should be available by including , however in gcc you need to include . Additionally there are no defines for XFEATURE_ENABLED_MASK and possible output values.
[Bug target/71659] _xgetbv intrinsic missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659 --- Comment #5 from Daniel Fruzynski --- I meant pr85684
[Bug c/88959] New: Unnecessary xor before bsf/tzcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959 Bug ID: 88959 Summary: Unnecessary xor before bsf/tzcnt Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] int test(int x) { return __builtin_ctz(x); } [/code] gcc 4.9.1 with -O3 produces this: [asm] test(int): rep bsf eax, edi ret [/asm] And this with -O3 -mbmi: [asm] test(int): tzcnt eax, edi ret [/asm] gcc 4.9.2 and newer (including gcc 9) produces this for both cases: [asm] test(int): xor eax, eax rep bsf eax, edi ret [/asm] [asm] test(int): xor eax, eax tzcnt eax, edi ret [/asm] This extra xor instruction is not needed here.
[Bug c/88959] Unnecessary xor before bsf/tzcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959 --- Comment #1 from Daniel Fruzynski --- I have found that this extra xor is not added when compiling with -O3 -march=sandybridge or -O3 -march=ivydybridge. However with -O3 -march=sandybridge/ivydybridge -mbmi it is added.
[Bug c/88963] New: gcc generates terrible code for vectors of 64+ length which are not natively supported
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 Bug ID: 88963 Summary: gcc generates terrible code for vectors of 64+ length which are not natively supported Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] typedef int VInt __attribute__((vector_size(64))); void test(VInt*__restrict a, VInt*__restrict b, VInt*__restrict c) { *a = *b + *c; } [/code] This code compiled with -O3 -march=skylake in following way: [asm] test(int __vector(16)*, int __vector(16)*, int __vector(16)*): push rbp mov rbp, rsp and rsp, -64 sub rsp, 136 vmovdqa xmm3, XMMWORD PTR [rsi] vmovdqa xmm4, XMMWORD PTR [rsi+16] vmovdqa xmm5, XMMWORD PTR [rsi+32] vmovdqa xmm6, XMMWORD PTR [rsi+48] vmovdqa xmm7, XMMWORD PTR [rdx] vmovaps XMMWORD PTR [rsp-56], xmm3 vmovdqa xmm1, XMMWORD PTR [rdx+16] vmovaps XMMWORD PTR [rsp-40], xmm4 vmovdqa ymm4, YMMWORD PTR [rsp-56] vmovdqa xmm2, XMMWORD PTR [rdx+32] vmovaps XMMWORD PTR [rsp-8], xmm6 vmovaps XMMWORD PTR [rsp+8], xmm7 vmovdqa xmm3, XMMWORD PTR [rdx+48] vmovaps XMMWORD PTR [rsp-24], xmm5 vmovaps XMMWORD PTR [rsp+24], xmm1 vpaddd ymm0, ymm4, YMMWORD PTR [rsp+8] vmovdqa ymm5, YMMWORD PTR [rsp-24] vmovaps XMMWORD PTR [rsp+40], xmm2 vmovaps XMMWORD PTR [rsp+56], xmm3 vmovdqa xmm2, xmm0 vmovdqa YMMWORD PTR [rsp-120], ymm0 vpaddd ymm0, ymm5, YMMWORD PTR [rsp+40] vmovdqa xmm6, XMMWORD PTR [rsp-104] vmovdqa YMMWORD PTR [rsp-88], ymm0 vmovdqa xmm7, XMMWORD PTR [rsp-72] vmovaps XMMWORD PTR [rdi], xmm2 vmovaps XMMWORD PTR [rdi+16], xmm6 vmovaps XMMWORD PTR [rdi+32], xmm0 vmovaps XMMWORD PTR [rdi+48], xmm7 vzeroupper leave ret [/asm] Other compilers (clang, icc) produces nice code. This is from clang: [asm] test(int __vector(16)*, int __vector(16)*, int __vector(16)*): # @test(int __vector(16)*, int __vector(16)*, int __vector(16)*) vmovdqa ymm0, ymmword ptr [rdx] vmovdqa ymm1, ymmword ptr [rdx + 32] vpaddd ymm0, ymm0, ymmword ptr [rsi] vpaddd ymm1, ymm1, ymmword ptr [rsi + 32] vmovdqa ymmword ptr [rdi + 32], ymm1 vmovdqa ymmword ptr [rdi], ymm0 vzeroupper ret [/asm] gcc produces pretty code for -O3 -march=skylake-avx512. Pretty code is also for vector size 32 with AVX disabled. However for vector size 128 and -O3 -march=skylake-avx512 code is again ugly.
[Bug c++/91235] New: Array size expression is implicitly casted to unsigned long type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235 Bug ID: 91235 Summary: Array size expression is implicitly casted to unsigned long type Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] void foo(char*); inline void bar(int n) { if (__builtin_constant_p(n)) { char a[(int)(n == 2 ? -1 : 0)]; foo(a); } } void baz() { bar(2); } [/code] When this is compiled with -O3 -Wall -Wextra -std=c++11 (tested via godbolt.org), it produces following code: [asm] baz(): push rbp mov rbp, rsp mov rdi, rsp call foo(char*) leave ret [/asm] During compilation gcc reported following warning: [out] : In function 'void baz()': :7:14: warning: argument to variable-length array is too large [-Wvla-larger-than=] 7 | char a[(int)(n == 2 ? -1 : 0)]; | ^ :7:14: note: limit is 9223372036854775807 bytes, but argument is 18446744073709551615 Compiler returned: 0 [out] This means that gcc saw that n is constant, and then expression specified as array size was evaluated and implicitly casted to unsigned type. When I removed "foo(a);" line, this warning is gone, and gcc warned about unused variable. When -1 is specified as array size, it correctly report error that array size is negative. Looks that only expressions causes this issue.
[Bug c++/91235] Array size expression is implicitly casted to unsigned long type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235 --- Comment #1 from Daniel Fruzynski --- I checked that trunk gcc also accepts this code, both with -std=c++11 and -std=c++1z. Clang also compiles this without error. Could someone take a look on this and add some comment here?
[Bug c/83369] New: Missing diagnostics during inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83369 Bug ID: 83369 Summary: Missing diagnostics during inlining Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- When code below is compiled, gcc prints warnings that null is passed to function with nonnull attribute. However gcc does not point that error is caused by inlining of my_strcpy at line 32 of test.cc. Code was compiled using gcc (GCC) 8.0.0 20171210 (experimental). [code] #include char buf[100]; struct Test { const char* s1; const char* s2; }; __attribute((nonnull(1, 2))) inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src, size_t size) { size_t len = strlen(src); if (len < size) memcpy(dst, src, len + 1); else { memcpy(dst, src, size - 1); dst[size - 1] = '\0'; } return dst; } void test(Test* test) { if (test->s1) my_strcpy(buf, test->s1, sizeof(buf)); else if (test->s2) my_strcpy(buf, test->s2, sizeof(buf)); else my_strcpy(buf, test->s2, sizeof(buf)); // error, line 32 } [/code] [out] $ g++ -c -o test.o test.cc -O2 -Wall test.cc: In function ‘void test(Test*)’: test.cc:14:24: warning: argument 1 null where non-null expected [-Wnonnull] size_t len = strlen(src); ~~^ In file included from test.cc:1: /usr/include/string.h:395:15: note: in a call to function ‘size_t strlen(const char*)’ declared here extern size_t strlen (const char *__s) ^~ test.cc:16:15: warning: argument 2 null where non-null expected [-Wnonnull] memcpy(dst, src, len + 1); ~~^~~ In file included from test.cc:1: /usr/include/string.h:42:14: note: in a call to function ‘void* memcpy(void*, const void*, size_t)’ declared here extern void *memcpy (void *__restrict __dest, const void *__restrict __src, ^~ test.cc:19:15: warning: argument 2 null where non-null expected [-Wnonnull] memcpy(dst, src, size - 1); ~~^~~~ In file included from test.cc:1: /usr/include/string.h:42:14: note: in a call to function ‘void* memcpy(void*, const void*, size_t)’ declared here extern void *memcpy (void *__restrict __dest, const void *__restrict __src, ^~ [/out]
[Bug c/83373] New: False positive reported by -Wstringop-overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373 Bug ID: 83373 Summary: False positive reported by -Wstringop-overflow Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- When code below is compiled, gcc incorrectly complains that memcpy will read data after end of buffer in line marked with star. Looks that gcc does not take into account that 'if' above protects against this. Code was compiles using gcc (GCC) 8.0.0 20171210 (experimental). [code] #include char buf[100]; void get_data(char* ptr); __attribute((nonnull(1, 2))) inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src, size_t size) { size_t len = strlen(src); if (len < size) memcpy(dst, src, len + 1); else { memcpy(dst, src, size - 1); //* dst[size - 1] = '\0'; } return dst; } void test() { char data[20]; get_data(data); my_strcpy(buf, data, sizeof(buf)); } [/code] [out] $ g++ -c -o test.o test.cc -O2 -Wall In function ‘char* my_strcpy(char*, const char*, size_t)’, inlined from ‘void test()’ at test.cc:25:14: test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ reading 99 bytes from a region of size 20 [-Wstringop-overflow=] memcpy(dst, src, size - 1); //* ~~^~~~ [/out]
[Bug middle-end/81914] [7/8 Regression] gcc 7.1 generates branch for code which was branchless in earlier gcc version
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81914 --- Comment #9 from Daniel Fruzynski --- In the meantime I found another case when gcc 7 inserts lots of jumps. I am not sure if your extra test cases covers it too: #include int test(int data1[9][9], int data2[9][9]) { uint64_t b1 = 0, b2 = 0; for (int n = 0; n < 9; ++n) { for (int k = 0; k < 9; ++k) { int a = data1[n][k] * 9 + data2[n][k]; (a < 64 ? b1 : b2) |= 1 << (a & 63); } } return __builtin_popcount(b1) + __builtin_popcount(b2); }
[Bug middle-end/83373] False positive reported by -Wstringop-overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373 --- Comment #4 from Daniel Fruzynski --- > Bug 83373 - False positive reported by -Wstringop-overflow, is > another example of warning triggered by a missed optimization > opportunity, this time in the strlen pass. The optimization > is discussed in pr78450 - strlen(s) return value can be assumed > to be less than the size of s. The gist of it is that the result > of strlen(array) can be assumed to be less than the size of > the array (except in the corner case of last struct members). This approach is not good from my perspective. I have structs used for IPC purposes, and I cannot reorder fields or add a new one at the end to silence this warning. Better approach would be to explicitly mark structure as flexible width with special attribute, and use this approach only for structures marked this way. As you wrote, this is a corner case, so requiring this attribute there can be accepted. This can also improve diagnostics in other cases, if you use similar approach for other warnings too.
[Bug middle-end/83373] False positive reported by -Wstringop-overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373 --- Comment #6 from Daniel Fruzynski --- My understanding is that after this patch will be applied, gcc will still emit warning for last field in struct, e.g. like in code below. Is my understanding correct or I missed something? struct Msg { int op; char str1[100]; char str2[100]; }; ... void func() { Msg msg; msg.op = 5; char data1[20], data2[20]; get_data(data1); get_data(data2); my_strcpy(msg.str1, data1, sizeof(msg.str1)); // OK, no warning my_strcpy(msg.str2, data2, sizeof(msg.str2)); // Warning still present send_msg(&msg, sizeof(msg)); }
[Bug middle-end/83373] False positive reported by -Wstringop-overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373 --- Comment #7 from Daniel Fruzynski --- In my case structures like Msg above are generated from IDL files together with code for serialization and deserialization. Because of this I cannot freely move or add new fields there, this may break compatibility.
[Bug middle-end/83373] False positive reported by -Wstringop-overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83373 --- Comment #9 from Daniel Fruzynski --- Thanks for explanation. In addition to allocation on stack, my app also uses custom allocator function like below. So in this case it also should work as expected. void* msg_alloc(int msg_id); ... Msg* msg = (Msg*)msg_alloc(ID_OF_MSG); ... Anyway, this new attribute looks useful for me, it probably could allow better diagnostics and optimization. However treating all [sub]objects without this attribute as a fixed size may break some existing code, so extra command line switch to enable old (current) behavior also would be needed. All of this probably needs separate issue here to track it.
[Bug c++/83429] New: Incorrect line number reported by -Wformat-truncation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429 Bug ID: 83429 Summary: Incorrect line number reported by -Wformat-truncation Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #include struct S { char str1[10]; char str2[10]; char out[15]; }; void test(S* s) // line 10 { snprintf(s->out, sizeof(s->out), "%s.%s", s->str1, s->str2); // line 12 } [/code] When above code is compiles using "g++ -c -o test.o test.cc -O2 -Wall", it produces following output: [out] test.cc: In function ‘void test(S*)’: test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 9 bytes into a region of size between 5 and 14 [-Wformat-truncation=] void test(S* s) // line 10 ^~~~ test.cc:12:13: note: ‘snprintf’ output between 2 and 20 bytes into a destination of size 15 snprintf(s->out, sizeof(s->out), "%s.%s", s->str1, s->str2); // line 12 ^~~ [/out] As you can see, line number in "warning:" line is incorrect - it points to line with function name. Fortunately correct number is in line with "note:". However when code is compiled with -D_FORTIFY_SOURCE=1 added, you loose this important piece of information: [out] test.cc: In function ‘void test(S*)’: test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 9 bytes into a region of size between 5 and 14 [-Wformat-truncation=] void test(S* s) // line 10 ^~~~ In file included from /usr/include/stdio.h:937, from test.cc:1: /usr/include/bits/stdio2.h:64:35: note: ‘__builtin_snprintf’ output between 2 and 20 bytes into a destination of size 15 return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, ~^~~ __bos (__s), __fmt, __va_arg_pack ()); ~ [/out] g++ --version g++ (GCC) 8.0.0 20171210 (experimental)
[Bug c++/83430] New: buffer overflow diagnostics for snprintf is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83430 Bug ID: 83430 Summary: buffer overflow diagnostics for snprintf is broken Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #include struct S { char str[20]; char out[15]; }; void test(S* s) { snprintf(s->out, sizeof(s->str), "[%s]", s->str); } [/code] [out] $ g++ -c -o test.o test.cc -O2 -Wall test.cc: In function ‘void test(S*)’: test.cc:9:6: warning: ‘]’ directive output may be truncated writing 1 byte into a region of size between 0 and 19 [-Wformat-truncation=] void test(S* s) ^~~~ test.cc:11:13: note: ‘snprintf’ output between 3 and 22 bytes into a destination of size 20 snprintf(s->out, sizeof(s->str), "[%s]", s->str); ^~~~ [/out] There are two problems there: - snprintf does not detect that actual size of out is 15 bytes, not 20; - code passes size of one of input arguments which will be part of output string instead of output buffer size. Output for compilation with -D_FORTIFY_SOURCE=2 has the same problems. g++ --version g++ (GCC) 8.0.0 20171210 (experimental)
[Bug c++/83431] New: -Wformat-truncation may incorrectly report truncation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83431 Bug ID: 83431 Summary: -Wformat-truncation may incorrectly report truncation Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- This looks like another missing optimization - -Wformat-truncation does not take into account that there is "if" which checks that truncation will not happen. [code] #include #include struct S { char str[20]; char out[10]; }; void test(S* s) { if (strlen(s->str) < sizeof(s->out) - 2) snprintf(s->out, sizeof(s->out), "[%s]", s->str); } [/code] [out] $ g++ -c -o test.o test.cc -O2 -Wall test.cc: In function ‘void test(S*)’: test.cc:10:6: warning: ‘%s’ directive output may be truncated writing up to 19 bytes into a region of size 9 [-Wformat-truncation=] void test(S* s) ^~~~ test.cc:13:17: note: ‘snprintf’ output between 3 and 22 bytes into a destination of size 10 snprintf(s->out, sizeof(s->out), "[%s]", s->str); ^~~~ [/out] g++ --version g++ (GCC) 8.0.0 20171210 (experimental)
[Bug middle-end/59521] __builtin_expect not effective in switch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521 Daniel Fruzynski changed: What|Removed |Added CC||bugzilla@poradnik-webmaster ||a.com --- Comment #15 from Daniel Fruzynski --- +1 for this, I wanted to request this today too. I see that some patch is ready, how is review going?
[Bug c++/83429] Incorrect line number reported by -Wformat-truncation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429 --- Comment #1 from Daniel Fruzynski --- Another test case, this time "note:" with argument range also points to incorrect line: [code] #include struct S { unsigned char n; char out[2]; }; void test(S* s) // line 9 { snprintf(s->out, sizeof(s->out), "%d", s->n); // line 11 } [/code] [out] test.cc: In function ‘void test(S*)’: test.cc:9:6: warning: ‘%d’ directive output may be truncated writing between 1 and 3 bytes into a region of size 2 [-Wformat-truncation=] void test(S* s) // line 9 ^~~~ test.cc:9:6: note: directive argument in the range [0, 255] test.cc:11:13: note: ‘snprintf’ output between 2 and 4 bytes into a destination of size 2 snprintf(s->out, sizeof(s->out), "%d", s->n); // line 11 ^~~~ [/out]
[Bug c++/83429] Incorrect line number reported by -Wformat-truncation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83429 --- Comment #2 from Daniel Fruzynski --- Sometimes actual location is not reported at all: [code] #include #include struct S { char* str; int n; char out[10]; }; void test(S* s) { if (s->str) snprintf(s->out, sizeof(s->out), "%d", s->n); else snprintf(s->out, sizeof(s->out), ".%s", s->str); } [/code] [out] test.cc: In function ‘void test(S*)’: test.cc:11:6: warning: ‘%s’ directive argument is null [-Wformat-truncation=] void test(S* s) ^~~~ [/out]
[Bug c/83479] New: Register spilling in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479 Bug ID: 83479 Summary: Register spilling in AVX code Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Here is snipped of code which performs some calculations on matrix. It repeatedly transforms some (N * N) matrix into (N-1 * N-1) one, and returns final scalar value. gcc for some reason is not able to detect that intermediate values are not needed anymore, and starts spilling. Code below is from gcc 7.2, trunk version also generates similar code. Code was compiled with "-O3 -march=haswell". BTW, clang 5 properly handles this and does not spill. [code] #include "immintrin.h" double test(const double data[9][8]) { __m256d vLastRow, vLastCol, vSqrtRow, vSqrtCol; __m256d v1 = _mm256_load_pd (&data[0][0]); __m256d v2 = _mm256_load_pd (&data[1][0]); __m256d v3 = _mm256_load_pd (&data[2][0]); __m256d v4 = _mm256_load_pd (&data[3][0]); __m256d v5 = _mm256_load_pd (&data[4][0]); __m256d v6 = _mm256_load_pd (&data[5][0]); __m256d v7 = _mm256_load_pd (&data[6][0]); __m256d v8 = _mm256_load_pd (&data[7][0]); // 8 vLastRow = _mm256_load_pd (&data[9][0]); vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[3]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[4]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[5]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[6]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[7]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v8 = (v8 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 7 vLastRow = v8; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[3]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[4]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[5]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[6]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 6 vLastRow = v7; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[3]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[4]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[5]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 5 vLastRow = v6; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtC
[Bug c/83479] Register spilling in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479 --- Comment #1 from Daniel Fruzynski --- Here is clang 5.0 output, it is also shorted than gcc one (213 lines, gcc produced 247). test(double const (*) [8]): # @test(double const (*) [8]) vmovapd ymm3, ymmword ptr [rdi + 64] vmovapd ymm4, ymmword ptr [rdi + 128] vmovapd ymm5, ymmword ptr [rdi + 192] vmovapd ymm6, ymmword ptr [rdi + 256] vmovapd ymm8, ymmword ptr [rdi + 320] vmovapd ymm2, ymmword ptr [rdi + 384] vmovapd ymm1, ymmword ptr [rdi + 576] vsqrtpd ymm0, ymm1 vmovupd ymmword ptr [rsp - 56], ymm0 # 32-byte Spill vpermpd ymm7, ymm1, 85 # ymm7 = ymm1[1,1,1,1] vsqrtpd ymm9, ymm7 vmovapd ymm10, ymmword ptr [rdi + 448] vmulpd ymm7, ymm1, ymm7 vsubpd ymm3, ymm3, ymm7 vmulpd ymm3, ymm0, ymm3 vpermpd ymm7, ymm1, 170 # ymm7 = ymm1[2,2,2,2] vsqrtpd ymm11, ymm7 vmulpd ymm9, ymm9, ymm3 vmulpd ymm3, ymm1, ymm7 vsubpd ymm3, ymm4, ymm3 vmulpd ymm3, ymm0, ymm3 vpermpd ymm4, ymm1, 255 # ymm4 = ymm1[3,3,3,3] vsqrtpd ymm7, ymm4 vmulpd ymm11, ymm3, ymm11 vmulpd ymm3, ymm1, ymm4 vsubpd ymm3, ymm5, ymm3 vmulpd ymm3, ymm0, ymm3 vmulpd ymm4, ymm3, ymm7 vsqrtpd ymm7, ymm0 vmulpd ymm3, ymm1, ymm0 vsubpd ymm5, ymm6, ymm3 vmulpd ymm5, ymm0, ymm5 vmulpd ymm5, ymm5, ymm7 vsubpd ymm6, ymm8, ymm3 vmulpd ymm6, ymm0, ymm6 vmulpd ymm6, ymm6, ymm7 vsubpd ymm2, ymm2, ymm3 vmulpd ymm2, ymm0, ymm2 vmulpd ymm8, ymm2, ymm7 vsubpd ymm2, ymm10, ymm3 vmulpd ymm2, ymm0, ymm2 vmulpd ymm3, ymm2, ymm7 vsqrtpd ymm2, ymm3 vpermpd ymm10, ymm3, 85 # ymm10 = ymm3[1,1,1,1] vsqrtpd ymm12, ymm10 vmulpd ymm10, ymm3, ymm10 vsubpd ymm9, ymm9, ymm10 vmulpd ymm9, ymm2, ymm9 vmulpd ymm9, ymm12, ymm9 vpermpd ymm10, ymm3, 170 # ymm10 = ymm3[2,2,2,2] vsqrtpd ymm12, ymm10 vmulpd ymm10, ymm3, ymm10 vsubpd ymm10, ymm11, ymm10 vmulpd ymm10, ymm2, ymm10 vmulpd ymm10, ymm12, ymm10 vpermpd ymm11, ymm3, 255 # ymm11 = ymm3[3,3,3,3] vsqrtpd ymm12, ymm11 vmulpd ymm11, ymm3, ymm11 vsubpd ymm4, ymm4, ymm11 vmulpd ymm4, ymm2, ymm4 vmulpd ymm11, ymm4, ymm12 vmulpd ymm4, ymm3, ymm0 vsubpd ymm5, ymm5, ymm4 vmulpd ymm5, ymm2, ymm5 vmulpd ymm12, ymm7, ymm5 vsubpd ymm5, ymm6, ymm4 vmulpd ymm6, ymm2, ymm5 vsubpd ymm4, ymm8, ymm4 vmulpd ymm4, ymm2, ymm4 vmulpd ymm5, ymm7, ymm4 vsqrtpd ymm4, ymm5 vpermpd ymm8, ymm5, 85 # ymm8 = ymm5[1,1,1,1] vsqrtpd ymm13, ymm8 vmulpd ymm6, ymm7, ymm6 vmulpd ymm8, ymm5, ymm8 vsubpd ymm8, ymm9, ymm8 vmulpd ymm8, ymm4, ymm8 vpermpd ymm9, ymm5, 170 # ymm9 = ymm5[2,2,2,2] vsqrtpd ymm14, ymm9 vmulpd ymm13, ymm13, ymm8 vmulpd ymm8, ymm5, ymm9 vsubpd ymm8, ymm10, ymm8 vmulpd ymm8, ymm4, ymm8 vpermpd ymm9, ymm5, 255 # ymm9 = ymm5[3,3,3,3] vsqrtpd ymm10, ymm9 vmulpd ymm14, ymm8, ymm14 vmulpd ymm8, ymm5, ymm9 vsubpd ymm8, ymm11, ymm8 vmulpd ymm8, ymm4, ymm8 vmulpd ymm9, ymm8, ymm10 vmulpd ymm8, ymm5, ymm0 vsubpd ymm10, ymm12, ymm8 vmulpd ymm10, ymm4, ymm10 vmulpd ymm10, ymm7, ymm10 vsubpd ymm6, ymm6, ymm8 vmulpd ymm6, ymm4, ymm6 vmulpd ymm8, ymm7, ymm6 vsqrtpd ymm6, ymm8 vpermpd ymm11, ymm8, 85 # ymm11 = ymm8[1,1,1,1] vsqrtpd ymm12, ymm11 vmulpd ymm11, ymm8, ymm11 vsubpd ymm11, ymm13, ymm11 vmulpd ymm11, ymm6, ymm11 vmulpd ymm11, ymm11, ymm12 vpermpd ymm12, ymm8, 170 # ymm12 = ymm8[2,2,2,2] vsqrtpd ymm13, ymm12 vmulpd ymm12, ymm8, ymm12 vsubpd ymm12, ymm14, ymm12 vmulpd ymm12, ymm6, ymm12 vmulpd ymm12, ymm12, ymm13 vpermpd ymm13, ymm8, 255 # ymm13 = ymm8[3,3,3,3] vsqrtpd ymm14, ymm13 vmulpd ymm13, ymm8, ymm13 vsubpd ymm13, ymm9, ymm13 vmulpd ymm9, ymm8, ymm0 vsubpd ymm9, ymm10, ymm9 vmulpd ymm9, ymm9, ymm6 vmulpd ymm9, ymm7, ymm9 vsqrtpd ymm7, ymm9 vpermpd ymm10, ymm9, 85 # ymm10 = ymm9[1,1,1,1] vsqrtpd ymm15, ymm10 vmulpd ymm13, ymm6, ymm13 vmulpd ymm13, ymm13, ymm14 vmulpd ymm10, ymm9, ymm10 vsubpd ymm10, ymm11, ymm10 vpermpd ymm11, ymm9, 170 # ymm11 = ymm9[2,2,2,2] vsqrtpd ymm14, ymm11 vmulpd ymm10, ymm10, ymm7 vmulpd ymm15, ymm10, ymm15 vmulpd ymm10, ymm9, ymm11 vsubpd ymm10, ymm12, ymm10 vpermpd ymm11, ymm9, 255 # ymm11 = ymm9[3,3,3,3] vsqrtpd ymm12, ymm11 vmulpd ymm0, ymm10, ymm7 vmulpd ymm10, ymm9, ymm11 vsubpd ymm10, ymm13, ymm10 vmulpd ymm10, ymm7, ymm10 vmulpd ymm11, ymm10, ymm12 vsqrtpd ymm10, ymm11 vpermpd ymm12, ymm11, 85 # ymm12 = ymm11[1,1,1,1] vsqrtpd ymm13, ymm12 vmulpd ymm0, ymm0, ymm14 vmulpd ymm12, ymm11, ymm12 vsubpd ymm12, ymm15, ymm12 vmulpd ymm12, ymm10, ymm12 vpermpd ymm14, ymm11, 170 # ymm14 = ymm11[2,2,2,2] vsqrtpd ymm15, ymm14 vmulpd ymm12, ymm13, ymm12 vmulpd ymm13, ymm11, ymm14 vsubpd ymm0, ymm0, ymm13 vmulpd ymm0, ymm10, ymm0 vmulpd ymm13, ymm15, ymm0 vsqrtpd ymm0, ymm13 vmovupd ymmword ptr [rsp - 88], ymm0 # 32-byte Spill vpermpd ymm14, ymm13, 85 # ymm14 = ymm13[1,1,1,1] vsqrtpd ymm15, ymm14 vmulpd ymm14, ymm13, ymm14 vsubpd ymm12, ymm12, ymm14 vmulpd ymm12, ymm0, y
[Bug target/83479] Register spilling in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479 --- Comment #4 from Daniel Fruzynski --- Rule No.1: never log bugs before morning coffee ;) This does not produce warnings, compiled with "-O3 -march=haswell -mavx512f -mavx512vl -mavx512bw -mavx512dq -mavx512cd -Wall -Werror". [code] #include "immintrin.h" double test(const double data[9][8]) { __m512d vLastRow, vLastCol, vSqrtRow, vSqrtCol; __m512d v1 = _mm512_load_pd (&data[0][0]); __m512d v2 = _mm512_load_pd (&data[1][0]); __m512d v3 = _mm512_load_pd (&data[2][0]); __m512d v4 = _mm512_load_pd (&data[3][0]); __m512d v5 = _mm512_load_pd (&data[4][0]); __m512d v6 = _mm512_load_pd (&data[5][0]); __m512d v7 = _mm512_load_pd (&data[6][0]); __m512d v8 = _mm512_load_pd (&data[7][0]); // 8 vLastRow = _mm512_load_pd (&data[9][0]); vSqrtRow = _mm512_sqrt_pd(vLastRow); vLastCol = _mm512_set1_pd(vLastRow[0]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[1]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[2]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[3]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[4]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[5]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[6]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[7]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v8 = (v8 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 7 vLastRow = v8; vSqrtRow = _mm512_sqrt_pd(vLastRow); vLastCol = _mm512_set1_pd(vLastRow[0]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[1]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[2]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[3]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[4]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[5]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[6]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v7 = (v7 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 6 vLastRow = v7; vSqrtRow = _mm512_sqrt_pd(vLastRow); vLastCol = _mm512_set1_pd(vLastRow[0]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[1]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[2]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[3]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[4]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[5]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v6 = (v6 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 5 vLastRow = v6; vSqrtRow = _mm512_sqrt_pd(vLastRow); vLastCol = _mm512_set1_pd(vLastRow[0]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[1]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[2]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[3]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[4]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v5 = (v5 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 4 vLastRow = v5; vSqrtRow = _mm512_sqrt_pd(vLastRow); vLastCol = _mm512_set1_pd(vLastRow[0]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_set1_pd(vLastRow[1]); vSqrtCol = _mm512_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm512_s
[Bug target/83479] Register spilling in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479 --- Comment #5 from Daniel Fruzynski --- Here is also valid AVX version, it also spills a bit. Compiled with "-O3 -march=haswell -Wall -Werror". [code] #include "immintrin.h" double test(const double data[5][4]) { __m256d vLastRow, vLastCol, vSqrtRow, vSqrtCol; __m256d v1 = _mm256_load_pd (&data[0][0]); __m256d v2 = _mm256_load_pd (&data[1][0]); __m256d v3 = _mm256_load_pd (&data[2][0]); __m256d v4 = _mm256_load_pd (&data[3][0]); // 4 vLastRow = _mm256_load_pd (&data[4][0]); vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[3]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v4 = (v4 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 3 vLastRow = v4; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[2]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v3 = (v3 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 2 vLastRow = v3; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; vLastCol = _mm256_set1_pd(vLastRow[1]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v2 = (v2 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; // 1 vLastRow = v2; vSqrtRow = _mm256_sqrt_pd(vLastRow); vLastCol = _mm256_set1_pd(vLastRow[0]); vSqrtCol = _mm256_sqrt_pd(vLastCol); v1 = (v1 - vLastRow * vLastCol) * vSqrtRow * vSqrtCol; return v1[0]; } [/code]
[Bug target/83479] Register spilling in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83479 --- Comment #6 from Daniel Fruzynski --- One correction: In c#4 line 17 has incorrect index, should be 8 instead of 9. For some reason gcc did not complain here. vLastRow = _mm512_load_pd (&data[8][0]);
[Bug middle-end/81914] [7 Regression] gcc 7.1 generates branch for code which was branchless in earlier gcc version
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81914 --- Comment #12 from Daniel Fruzynski --- One more test case. Code compiled with TEST defined is branchless, without it has branch. [code] #include #define TEST void test(uint64_t* a) { uint64_t n = *a / 8; if (0 == n) n = 1; #ifdef TEST *a += n; #else *a += 1 << n; #endif } [/code]
[Bug c/83610] New: __builtin_expect sometimes is ignored
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610 Bug ID: 83610 Summary: __builtin_expect sometimes is ignored Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] void f1(); void f2(); void test(int a, int b, int c, int d, int n, int k) { int val = a & b; if (__builtin_expect(!!(n == k), 0)) val &= c; if (__builtin_expect(!!(n == 10 - k), 0)) val &= d; if (val) f1(); else f2(); } [/code] This code compiled with gcc 4.8.5 generates branches as expected: [asm] test(int, int, int, int, int, int): and edi, esi cmp r8d, r9d je .L6 .L2: mov eax, 10 sub eax, r9d cmp r8d, eax je .L7 .L3: test edi, edi jne .L8 jmp f2() .L8: jmp f1() .L7: and edi, ecx jmp .L3 .L6: and edi, edx jmp .L2 [/asm] When this code is compiled with gcc 4.9.0 or higher, it generates branchless code like below. In my case it is slower than version with branches. I wanted to convince compiler to generate this version of code by using __builtin_expect, but for some reason it does not work. [asm] test(int, int, int, int, int, int): and esi, edi mov eax, 10 and edx, esi cmp r8d, r9d cmove esi, edx sub eax, r9d and ecx, esi cmp r8d, eax cmove esi, ecx test esi, esi jne .L6 jmp f2() .L6: jmp f1() [/asm]
[Bug c/83610] __builtin_expect sometimes is ignored
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610 --- Comment #1 from Daniel Fruzynski --- Code was compiled with "-O3 -march=core2 -mtune=generic"
[Bug middle-end/83610] __builtin_expect sometimes is ignored
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83610 --- Comment #3 from Daniel Fruzynski --- Created attachment 42980 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42980&action=edit Benchmark Here is benchmark for this case. With unlikely() execution time decreases from 20.5sec to 20.3sec - about 1%. For my real app change it was a bit more than 2%. Thanks for information about this parameter, I will give it a try. So far I noticed that gcc uses CMOV when values are stored in registers. When they are in memory as a class fields, it generates code with branches. I am still playing with this code, so maybe I will need it later. BTW, what do you thing about adding 3rd param to __builtin_expect, which will specify probability? It may be helpful in cases like mine.
[Bug target/81759] Improve data tracking for _pext_u64 and __builtin_ffsll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81759 --- Comment #2 from Daniel Fruzynski --- Looks that __builtin_ffs does not check if input value is nonzero at all. Assembler code for following code also has unnecessary instructions: [code] unsigned int test(unsigned int n) { if (n == 0) __builtin_unreachable(); return __builtin_ffs(n) - 1; } [/code]
[Bug c/83634] New: ICE in useless_type_conversion_p, at gimple-expr.c:86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83634 Bug ID: 83634 Summary: ICE in useless_type_conversion_p, at gimple-expr.c:86 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] void test(unsigned int* ptr) { const int foo = Foo(); const int bar = Bar(); unsigned short n; for (n = foo; n < 100; n += bar) {} } [/code] I was playing with Compiler Explorer (https://godbolt.org/) and got this ICE when compiling code above. g++ version reported by Compiler Explorer is g++ 8.0.0 20171230. [x86-64 gcc (trunk) #1] internal compiler error: tree check: expected class 'type', have 'exceptional' (error_mark) in useless_type_conversion_p, at gimple-expr.c:86
[Bug c/83634] ICE in useless_type_conversion_p, at gimple-expr.c:86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83634 --- Comment #1 from Daniel Fruzynski --- A bit simpler test case which triggers this ICE: [code] void test() { const int foo = Foo(); short n; for (n = foo; n < 100; ++n) {} } [/code]
[Bug c/83671] New: Fix for false positive reported by -Wstringop-overflow does not work with inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83671 Bug ID: 83671 Summary: Fix for false positive reported by -Wstringop-overflow does not work with inlining Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Fix for bug 83373 does not work well with inlining: [code] #include #include char dest[20]; char src[10]; __attribute((nonnull(1, 2))) inline char* my_strcpy(char* __restrict__ dst, const char* __restrict__ src, size_t size) { size_t len = strlen(src); if (len < size) memcpy(dst, src, len + 1); else { memcpy(dst, src, size - 1); dst[size - 1] = '\0'; } return dst; } inline void func1() { my_strcpy(dest, src, sizeof(dest)); } void func2() { func1(); } [/code] [out] $ g++ -c -o test.o test.cc -Wall -Wstringop-overflow=2 -O1 In function ‘char* my_strcpy(char*, const char*, size_t)’, inlined from ‘void func2()’ at test.cc:23:14: test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ forming offset [11, 19] is out of the bounds [0, 10] of object ‘src’ with type ‘char [10]’ [-Warray-bounds] memcpy(dst, src, size - 1); ~~^~~~ test.cc: In function ‘void func2()’: test.cc:5:6: note: ‘src’ declared here char src[10]; ^~~ In function ‘char* my_strcpy(char*, const char*, size_t)’, inlined from ‘void func2()’ at test.cc:23:14: test.cc:15:15: warning: ‘void* memcpy(void*, const void*, size_t)’ reading 19 bytes from a region of size 10 [-Wstringop-overflow=] memcpy(dst, src, size - 1); ~~^~~~ $ gcc --version gcc (GCC) 8.0.0 20171231 (experimental) [/out]
[Bug target/82915] Please mark intrinsics as constexpr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82915 --- Comment #2 from Daniel Fruzynski --- SIMD ISAa for other CPU types (e.g. ARM/AARCH64 NEON) also can benefit from this.
[Bug target/82915] Please mark intrinsics as constexpr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82915 --- Comment #4 from Daniel Fruzynski --- For tracking purposes it probably would be better to have separate issues for every CPU type which could benefit this. So this one could be for x86, and you could open other requests for other CPUs which supports SIMD instructions.
[Bug c/83688] New: Please check if buffers may overlap when copying strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688 Bug ID: 83688 Summary: Please check if buffers may overlap when copying strings Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Functions like strcpy internally use memcpy to copy data. This may cause problems when someone will try to use them to move string in buffer, e.g. to strip prefix. gcc is able to detect if overlapping buffers are used with memcpy. Please add similar diagnostics to strcpy/sprintf functions too. [code] #include #include char buf[20]; void test() { strcpy(buf, buf+5); memcpy(buf, buf+5, strlen(buf+5)+1); snprintf(buf, sizeof(buf), "%s", buf+5); memcpy(buf, buf+5, 10); } [/code] [out] $ g++ -c -o test.o test.cc -O3 -Wall -Wextra -Wformat-overflow -Wformat-truncation -Wstringop-overflow=2 -Wstringop-truncation test.cc: In function ‘void test()’: test.cc:13:11: warning: ‘void* memcpy(void*, const void*, size_t)’ accessing 10 bytes at offsets 0 and 5 overlaps 5 bytes at offset 5 [-Wrestrict] memcpy(buf, buf+5, 10); ~~^~~~ $ g++ --version g++ (GCC) 8.0.0 20171231 (experimental) [/out]
[Bug c/83688] Please check if buffers may overlap when copying strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688 --- Comment #1 from Daniel Fruzynski --- This also would allow to catch code which use sprintf to concatenate strings, what is an undefined behavior (snippet from https://linux.die.net/man/3/snprintf): sprintf(buf, "%s some further text", buf);
[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688 --- Comment #3 from Daniel Fruzynski --- Looks that something is not working properly. I have pasted output from compilation of function in 1st post, and -Wrestrict complained only about last memcpy call. Please take a look on this. BTW, string concatenation using sprintf causes -Wformat-overflow warning, so some protection against this is present. However this message does not say anything that this is undefined behavior per C standard.
[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688 --- Comment #5 from Daniel Fruzynski --- > There is nothing to indicate that the first call to memcpy() in comment #0 > overlaps so -Wrestrict doesn't warn for it. I thought that fix for bug 83373 will somehow help here. gcc could guess that memcpy will copy from 1 to 15 bytes, which may overlap destination. In fact this could help in all cases here except last memcpy.
[Bug c/83688] Please check if buffers may overlap when copying strings using sprintf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83688 --- Comment #7 from Daniel Fruzynski --- In general case yes, this can produce a lot of false positives. I wanted to use this only for strings stored in fixed-size buffer. Existing string-related warnings already uses this information, and this request is to extend diagnostics for other related cases where strings in fixed-size buffers are processed.
[Bug preprocessor/83773] New: Warning for redefined macro does not have its own -Wsomething switch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83773 Bug ID: 83773 Summary: Warning for redefined macro does not have its own -Wsomething switch Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: preprocessor Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Warning for redefined macro does not have its own -Wsomething switch, please add one. I also tried to use -fdiagnostics-show-option but it did not help. [code] #define AAA 1 #define AAA 2 [/code] [out] test.c:2: warning: "AAA" redefined #define AAA 2 test.c:1: note: this is the location of the previous definition #define AAA 1 [/out]
[Bug c/83859] New: Please add new attribute which will establish relation between parameters for buffer and its size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83859 Bug ID: 83859 Summary: Please add new attribute which will establish relation between parameters for buffer and its size Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- gcc can detect if buffer size passed to function like strncpy is incorrect, e.g. it is sizeof pointer. It would be good to have similar diagnostics enabled for custom functions also accepts buffer and its size. Please add new function attribute which would allow to do this and appropriate diagnostics which will use it. I propose to add following attribute with two parameters - indices of buffer and its size arguments. Note that function may accept multiple such pairs, so it should be possible to use this attribute multiple times. __attribute__((buffer_size(1, 2))) void foo(char* dst, size_t dstsize);
[Bug c/84085] New: Array element is unnecessary loaded twice
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84085 Bug ID: 84085 Summary: Array element is unnecessary loaded twice Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #define N 9 struct S1 { int a1[N][N]; }; struct S2 { int a2[N][N]; int a3[N][N]; }; void test1(S1* s1, S2* s2) { s2->a2[N-1][N-1] = s1->a1[N-1][N-1]; s2->a3[N-1][N-1] = 1u << s1->a1[N-1][N-1]; } void test2(S1* s1, S2* s2) { const int n = N*N-1; *((&s2->a2[0][0] + n)) = *(&s1->a1[0][0] + n); *((&s2->a3[0][0] + n)) = 1u << *(&s1->a1[0][0] + n); } void test3(S1* s1, S2* s2) { const int n = N*N-1; int x = *(&s1->a1[0][0] + n); *((&s2->a2[0][0] + n)) = x; *((&s2->a3[0][0] + n)) = 1u << x; } [/code] [out] test1(S1*, S2*): mov ecx, DWORD PTR [rdi+320] mov eax, 1 sal eax, cl mov DWORD PTR [rsi+320], ecx mov DWORD PTR [rsi+644], eax ret test2(S1*, S2*): mov eax, DWORD PTR [rdi+320] mov DWORD PTR [rsi+320], eax mov ecx, DWORD PTR [rdi+320] mov eax, 1 sal eax, cl mov DWORD PTR [rsi+644], eax ret test3(S1*, S2*): mov ecx, DWORD PTR [rdi+320] mov eax, 1 sal eax, cl mov DWORD PTR [rsi+320], ecx mov DWORD PTR [rsi+644], eax ret [/out] All 3 functions are equivalent. However when 2D array is treated as a 1D one, gcc for some reason loads array element twice (function test2). Local variable added in test3 allows to get the same code as for test1. I have found this during writing code for AARCH64, but x86_64 is also affected. gcc 8 (trunk) does not have this problem.
[Bug c/84106] New: gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106 Bug ID: 84106 Summary: gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #define N 9 int a1[N][N]; int a2[N][N]; int b1[N*N]; int b2[N*N]; void test1() { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { a2[i][j] = a1[i][j]; } } } void test2() { for (int i = 0; i < N*N; ++i) { b2[i] = b1[i]; } } [/code] This code compiled using gcc 8.0 (trunk) with "-O3 -mavx2" produces following result. For some reason gcc is not able to vectorize code for test2 function. I also tried to add "__attribute__((aligned(32)))" to all arrays, but it did not help. Similar code is also generated when compiling with "-O3 -mavx512f -mavx512vl -mavx512bw -mavx512dq -mavx512cd" - gcc still generates code which uses YMM registers, instead of ZMM ones. [out] test1(): vmovdqa ymm0, YMMWORD PTR a1[rip] vmovdqa ymm1, YMMWORD PTR a1[rip+32] vmovdqa ymm2, YMMWORD PTR a1[rip+64] vmovdqa ymm3, YMMWORD PTR a1[rip+96] vmovdqa YMMWORD PTR a2[rip], ymm0 vmovdqa ymm4, YMMWORD PTR a1[rip+128] vmovdqa ymm5, YMMWORD PTR a1[rip+160] vmovdqa YMMWORD PTR a2[rip+32], ymm1 vmovdqa ymm6, YMMWORD PTR a1[rip+192] vmovdqa ymm7, YMMWORD PTR a1[rip+224] vmovdqa ymm0, YMMWORD PTR a1[rip+256] vmovdqa ymm1, YMMWORD PTR a1[rip+288] vmovdqa YMMWORD PTR a2[rip+64], ymm2 mov eax, DWORD PTR a1[rip+320] vmovdqa YMMWORD PTR a2[rip+96], ymm3 vmovdqa YMMWORD PTR a2[rip+128], ymm4 vmovdqa YMMWORD PTR a2[rip+160], ymm5 vmovdqa YMMWORD PTR a2[rip+192], ymm6 vmovdqa YMMWORD PTR a2[rip+224], ymm7 vmovdqa YMMWORD PTR a2[rip+256], ymm0 vmovdqa YMMWORD PTR a2[rip+288], ymm1 mov DWORD PTR a2[rip+320], eax vzeroupper ret test2(): mov esi, OFFSET FLAT:b1 mov edi, OFFSET FLAT:b2 mov ecx, 40 rep movsq mov eax, DWORD PTR [rsi] mov DWORD PTR [rdi], eax ret b2: .zero 324 b1: .zero 324 a2: .zero 324 a1: .zero 324 [/out]
[Bug tree-optimization/84106] gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106 --- Comment #2 from Daniel Fruzynski --- Test included in comment 0 is part of bigger test which I performed. In full version code was also computing bitmask and stored in 3rd array. For test1 gcc was able to vectorize inner loop to series of load-shift-store-store operations. In test2 it separated loops into two - 1st one performing memcpy using "rep movsq", 2nd one calculating bitmasks using vector instructions. Here is full code and output: [code] #include #define N 9 int a1[N][N]; int a2[N][N]; int a3[N][N]; int b1[N*N]; int b2[N*N]; int b3[N*N]; void test1() { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { a2[i][j] = a1[i][j]; a3[i][j] = 1u << (uint8_t)a1[i][j]; } } } void test2() { for (int i = 0; i < N*N; ++i) { b2[i] = b1[i]; b3[i] = 1u << b1[i]; } } [/code] [out] test1(): vmovdqa ymm0, YMMWORD PTR .LC0[rip] vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip] mov eax, 1 vmovdqa ymm5, YMMWORD PTR a1[rip+96] vmovdqa ymm6, YMMWORD PTR a1[rip+128] vmovdqa ymm7, YMMWORD PTR a1[rip+160] vmovdqa ymm2, YMMWORD PTR a1[rip] vmovdqa YMMWORD PTR a3[rip], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip+32] vmovdqa ymm3, YMMWORD PTR a1[rip+32] vmovdqa YMMWORD PTR a2[rip], ymm2 vmovdqa ymm2, YMMWORD PTR a1[rip+192] vmovdqa ymm4, YMMWORD PTR a1[rip+64] vmovdqa YMMWORD PTR a2[rip+32], ymm3 vmovdqa ymm3, YMMWORD PTR a1[rip+224] vmovdqa YMMWORD PTR a3[rip+32], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR a1[rip+64] vmovdqa YMMWORD PTR a2[rip+64], ymm4 vmovdqa ymm4, YMMWORD PTR a1[rip+256] vmovdqa YMMWORD PTR a2[rip+96], ymm5 vmovdqa YMMWORD PTR a3[rip+64], ymm1 vpsllvd ymm1, ymm0, ymm5 vmovdqa ymm5, YMMWORD PTR a1[rip+288] vmovdqa YMMWORD PTR a2[rip+128], ymm6 vmovdqa YMMWORD PTR a3[rip+96], ymm1 vpsllvd ymm1, ymm0, ymm6 vmovdqa YMMWORD PTR a2[rip+160], ymm7 vmovdqa YMMWORD PTR a3[rip+128], ymm1 vpsllvd ymm1, ymm0, ymm7 vmovdqa YMMWORD PTR a2[rip+192], ymm2 vmovdqa YMMWORD PTR a3[rip+160], ymm1 vpsllvd ymm1, ymm0, ymm2 vmovdqa YMMWORD PTR a2[rip+224], ymm3 vmovdqa YMMWORD PTR a3[rip+192], ymm1 vpsllvd ymm1, ymm0, ymm3 vmovdqa YMMWORD PTR a2[rip+256], ymm4 vmovdqa YMMWORD PTR a3[rip+224], ymm1 vpsllvd ymm1, ymm0, ymm4 vpsllvd ymm0, ymm0, ymm5 vmovdqa YMMWORD PTR a3[rip+256], ymm1 vmovdqa YMMWORD PTR a2[rip+288], ymm5 mov ecx, DWORD PTR a1[rip+320] vmovdqa YMMWORD PTR a3[rip+288], ymm0 sal eax, cl mov DWORD PTR a2[rip+320], ecx mov DWORD PTR a3[rip+320], eax vzeroupper ret test2(): mov esi, OFFSET FLAT:b1 mov edi, OFFSET FLAT:b2 mov ecx, 40 vmovdqa ymm0, YMMWORD PTR .LC0[rip] rep movsq vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip] mov ecx, DWORD PTR b1[rip+320] vmovdqa YMMWORD PTR b3[rip], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+32] vmovdqa YMMWORD PTR b3[rip+32], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+64] mov eax, DWORD PTR [rsi] mov DWORD PTR [rdi], eax mov eax, 1 vmovdqa YMMWORD PTR b3[rip+64], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+96] sal eax, cl mov DWORD PTR b3[rip+320], eax vmovdqa YMMWORD PTR b3[rip+96], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+128] vmovdqa YMMWORD PTR b3[rip+128], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+160] vmovdqa YMMWORD PTR b3[rip+160], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+192] vmovdqa YMMWORD PTR b3[rip+192], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+224] vmovdqa YMMWORD PTR b3[rip+224], ymm1 vpsllvd ymm1, ymm0, YMMWORD PTR b1[rip+256] vpsllvd ymm0, ymm0, YMMWORD PTR b1[rip+288] vmovdqa YMMWORD PTR b3[rip+256], ymm1 vmovdqa YMMWORD PTR b3[rip+288], ymm0 vzeroupper ret b3: .zero 324 b2: .zero 324 b1: .zero 324 a3: .zero 324 a2: .zero 324 a1: .zero 324 .LC0: .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 .long 1 [/out]
[Bug tree-optimization/84106] gcc is not able to vectorize code for 1D array, but does so for 2D array of the same size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106 --- Comment #4 from Daniel Fruzynski --- Here are results of small benchmark executed on Xeon E5-2683 v3. Code was compiled using gcc 4.8.5. This gcc version also splits loops. Manually vectorized code is 3.5 times faster: [out] -- Benchmark Time CPU Iterations -- BM_test1 25 ns 25 ns 26989634 BM_test27 ns 7 ns 94495591 [/out] Benchmarko code: [code] #include #include "immintrin.h" #define N 81 int a1[N] __attribute__((aligned(32))); int a2[N] __attribute__((aligned(32))); int a3[N] __attribute__((aligned(32))); class Init { public: Init() { for (int n = 0; n < N; n++) { a1[n] = n % 32; } } } init; static void BM_test1(benchmark::State& state) { for (auto _ : state) { for (int n = 0; n < N; n++) { a2[n] = a1[n]; a3[n] = 1 << a1[n]; } benchmark::ClobberMemory(); } } BENCHMARK(BM_test1); static void BM_test2(benchmark::State& state) { for (auto _ : state) { int n = 0; for (; n < N - 7; n += 8) { __m256i v = _mm256_load_si256((__m256i*)(&a1[0] + n)); _mm256_store_si256((__m256i*)(&a2[0] + n), v); v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v); _mm256_store_si256((__m256i*)(&a3[0] + n), v); } for (; n < N; n++) { a2[n] = a1[n]; a3[n] = 1 << a1[n]; } benchmark::ClobberMemory(); } } BENCHMARK(BM_test2); BENCHMARK_MAIN(); [/code]
[Bug bootstrap/84199] New: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu): cannot load liblto_plugin.so
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84199 Bug ID: 84199 Summary: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu): cannot load liblto_plugin.so Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Created attachment 43337 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43337&action=edit Full build log I was trying to build gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu) but build failed with following error: /gcc/build/./gcc/xgcc -B/gcc/build/./gcc/ -B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/bin/ -B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/lib/ -isystem /gcc-7.3.0/armv7l-unknown-linux-gnueabihf/include -isystem /gcc-7.3.0/armv7l-unknown-linux-gnueabihf/sys-include-O2 -g -O2 -DIN_GCC -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fPIC -fno-inline -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -shared -nodefaultlibs -Wl,--soname=libgcc_s.so.1 -Wl,--version-script=libgcc.map -o ./libgcc_s.so.1.tmp -g -O2 -B./ _thumb1_case_sqi_s.o _thumb1_case_uqi_s.o _thumb1_case_shi_s.o _thumb1_case_uhi_s.o _thumb1_case_si_s.o _udivsi3_s.o _divsi3_s.o _umodsi3_s.o _modsi3_s.o _bb_init_func_s.o _call_via_rX_s.o _interwork_call_via_rX_s.o _lshrdi3_s.o _ashrdi3_s.o _ashldi3_s.o _arm_negdf2_s.o _arm_addsubdf3_s.o [cut cut cut] eqdf2_s.o gedf2_s.o ledf2_s.o muldf3_s.o negdf2_s.o subdf3_s.o unorddf2_s.o fixdfsi_s.o floatsidf_s.o floatunsidf_s.o extendsfdf2_s.o truncdfsf2_s.o enable-execute-stack_s.o unwind-arm_s.o libunwind_s.o pr-support_s.o unwind-c_s.o emutls_s.o libgcc.a -lc && rm -f ./libgcc_s.so && if [ -f ./libgcc_s.so.1 ]; then mv -f ./libgcc_s.so.1 ./libgcc_s.so.1.backup; else true; fi && mv ./libgcc_s.so.1.tmp ./libgcc_s.so.1 && (echo "/* GNU ld script"; echo " Use the shared library, but some functions are only in"; echo " the static library. */"; echo "GROUP ( libgcc_s.so.1 -lgcc )" ) > ./libgcc_s.so /usr/bin/ld: /gcc/build/./gcc/liblto_plugin.so: error loading plugin: /gcc/build/./gcc/liblto_plugin.so: cannot open shared object file: No such file or directory collect2: error: ld returned 1 exit status Makefile:977: recipe for target 'libgcc_s.so' failed make[3]: *** [libgcc_s.so] Error 1 make[3]: Leaving directory '/gcc/build/armv7l-unknown-linux-gnueabihf/libgcc' Makefile:21293: recipe for target 'all-stage2-target-libgcc' failed make[2]: *** [all-stage2-target-libgcc] Error 2 make[2]: Leaving directory '/gcc/build' Makefile:26191: recipe for target 'stage2-bubble' failed make[1]: *** [stage2-bubble] Error 2 make[1]: Leaving directory '/gcc/build' Makefile:939: recipe for target 'all' failed make: *** [all] Error 2 odroid@odroid-linux-1:~$ gcc --version gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.6) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. odroid@odroid-linux-1:~$ uname -a Linux odroid-linux-1 3.10.105-138 #1 SMP PREEMPT Fri Apr 7 12:40:29 UTC 2017 armv7l armv7l armv7l GNU/Linux odroid@odroid-linux-1:~$ cat /etc/*release* DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" NAME="Ubuntu" VERSION="16.04.3 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.3 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/"; SUPPORT_URL="http://help.ubuntu.com/"; BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"; VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial odroid@odroid-linux-1:~$
[Bug tree-optimization/84106] loop distribution cost-model needs work
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106 --- Comment #6 from Daniel Fruzynski --- When you will be revisiting your cost-model for loops, please also take a look on this code. test2 has one assignment moved to separate loops, and it is about twice as fast as test1 function (for gcc 4.8.5). [code] #include #include #define N 9 int a1[N][N]; int a2[N][N]; int a3[N][N]; uint16_t a4[N][N-1]; void test1() { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { a2[i][j] = a1[i][j]; a3[i][j] = 1u << a1[i][j]; if (i > 0) a4[j][i-1] = a3[i][j]; } } } void test2() { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { a2[i][j] = a1[i][j]; a3[i][j] = 1u << a1[i][j]; } } for (int i = 1; i < N; ++i) { for (int j = 0; j < N; ++j) { a4[j][i-1] = a3[i][j]; } } } [/code]
[Bug c++/89317] New: Ineffective code from std::copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317 Bug ID: 89317 Summary: Ineffective code from std::copy Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- gcc produces ineffective code when std::copy is used to copy data. For test I created my own version of std::copy and this version is optimized properly. Compiles using g++ (GCC-Explorer-Build) 9.0.1 20190211 (experimental) Options: -O3 -std=c++11 -march=skylake [code] #include #include #define Size 8 class Test { public: void test1(void*__restrict ptr); void test2(void*__restrict ptr); private: int16_t data1[Size]; int16_t data2[Size]; }; template void mycopy(T1 begin, T1 end, T2 dest) { while (begin != end) { *dest = *begin; ++dest; ++begin; } } void Test::test1(void*__restrict ptr) { uint16_t* p = (uint16_t*)ptr; std::copy(data1, data1 + Size, p); p += Size; std::copy(data2, data2 + Size, p); } void Test::test2(void*__restrict ptr) { int16_t* p = (int16_t*)ptr; mycopy(data1, data1 + Size, p); p += Size; mycopy(data2, data2 + Size, p); } [/code] [asm] Test::test1(void*): movzx eax, WORD PTR [rdi] mov edx, 16 mov WORD PTR [rsi], ax movzx eax, WORD PTR [rdi+2] add rsi, 16 mov WORD PTR [rsi-14], ax movzx eax, WORD PTR [rdi+4] mov WORD PTR [rsi-12], ax movzx eax, WORD PTR [rdi+6] mov WORD PTR [rsi-10], ax movzx eax, WORD PTR [rdi+8] mov WORD PTR [rsi-8], ax movzx eax, WORD PTR [rdi+10] mov WORD PTR [rsi-6], ax movzx eax, WORD PTR [rdi+12] mov WORD PTR [rsi-4], ax movzx eax, WORD PTR [rdi+14] mov WORD PTR [rsi-2], ax mov rax, rdx sar rax testrdx, rdx jle .L69 movzx edx, WORD PTR [rdi+16] mov WORD PTR [rsi], dx cmp rax, 1 je .L69 movzx edx, WORD PTR [rdi+18] mov WORD PTR [rsi+2], dx cmp rax, 2 je .L69 movzx edx, WORD PTR [rdi+20] mov WORD PTR [rsi+4], dx cmp rax, 3 je .L69 movzx edx, WORD PTR [rdi+22] mov WORD PTR [rsi+6], dx cmp rax, 4 je .L69 movzx edx, WORD PTR [rdi+24] mov WORD PTR [rsi+8], dx cmp rax, 5 je .L69 movzx edx, WORD PTR [rdi+26] mov WORD PTR [rsi+10], dx cmp rax, 6 je .L69 movzx edx, WORD PTR [rdi+28] mov WORD PTR [rsi+12], dx cmp rax, 7 je .L69 movzx edx, WORD PTR [rdi+30] mov WORD PTR [rsi+14], dx cmp rax, 8 je .L69 movzx edx, WORD PTR [rdi+32] mov WORD PTR [rsi+16], dx cmp rax, 9 je .L69 movzx edx, WORD PTR [rdi+34] mov WORD PTR [rsi+18], dx cmp rax, 10 je .L69 movzx edx, WORD PTR [rdi+36] mov WORD PTR [rsi+20], dx cmp rax, 11 je .L69 movzx edx, WORD PTR [rdi+38] mov WORD PTR [rsi+22], dx cmp rax, 12 je .L69 movzx edx, WORD PTR [rdi+40] mov WORD PTR [rsi+24], dx cmp rax, 13 je .L69 movzx edx, WORD PTR [rdi+42] mov WORD PTR [rsi+26], dx cmp rax, 14 je .L69 movzx eax, WORD PTR [rdi+44] mov WORD PTR [rsi+28], ax .L69: ret Test::test2(void*): vmovdqu xmm0, XMMWORD PTR [rdi] vmovups XMMWORD PTR [rsi], xmm0 vmovdqu xmm1, XMMWORD PTR [rdi+16] vmovups XMMWORD PTR [rsi+16], xmm1 ret [/asm]
[Bug tree-optimization/89317] Ineffective code from std::copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317 --- Comment #2 from Daniel Fruzynski --- Yes, I mean inefficient.
[Bug c/90293] New: New function attribute: expect_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293 Bug ID: 90293 Summary: New function attribute: expect_return Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- I have an idea of new function attribute: expect_return. It would allow to specify value usually returned from function, so it could help with optimization in similar way like __builtin_expect() does. Example use: __attribute__((expect_return(false))) bool DebugModeEnabled(); __attribute__((expect_return(false))) bool IsErrorCode(int code);
[Bug c/90293] New function attribute: expect_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293 --- Comment #1 from Daniel Fruzynski --- One more case: sometimes it may be more handy to specify what will *not* be usually returned, e.g. special invalid value. For such cases another attribute would be needed: __attribute__((expect_not_return(-1))) int CreateSocket();
[Bug c/90471] New: ICE Segmentation fault when compiling with debug info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471 Bug ID: 90471 Summary: ICE Segmentation fault when compiling with debug info Product: gcc Version: 7.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Created attachment 46353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46353&action=edit Preprocessed code I got ICE Segmentation fault when trying to build OpenCL BOINC app which I am developing. This happen only when I use -g option, without it code compiles fine. I compiled code using MinGW crossompiler shipped with Cygwin. Exact versions of all mingw packages are on attached screen. I also attached preprocessed source. I use 64-bit Cygwin on 64-bit Win 10 Pro with latest patches. When I was trying to remove unimportant parts of source code, I found interesting thing: I was able to comment out boinc_opencl.h include and crash still happen. However when I removed this line completely, gcc did not crash. This part of code looks as follows: [code] #define __CL_ENABLE_EXCEPTIONS #define CL_TARGET_OPENCL_VERSION 120 #define CL_USE_DEPRECATED_OPENCL_1_1_APIS #include "CL/cl.hpp" //#include "boinc_opencl.h" class OclException : public std::exception [/code] I can attach original files and all relevant headers if you need them too. $ x86_64-w64-mingw32-g++ -O3 -ftree-vectorize -std=c++11 -Wall -pthread -I/cygdrive/c/rakesearch/_boinc -I/cygdrive/c/rakesearch/_boinc/lib -I/cygdrive/c/rakesearch/_boinc/include/boinc -I. -D_BSD_SOURCE -g -c RakeSearchOpenCL2.cpp -o RakeSearchOpenCL.o RakeSearchOpenCL2.cpp: In member function ‘bool RakeSearchOpenCL::init(int, char**)’: RakeSearchOpenCL2.cpp:99:1: internal compiler error: Segmentation fault } ^ Please submit a full bug report, with preprocessed source if appropriate. See <https://gcc.gnu.org/bugs/> for instructions. $ x86_64-w64-mingw32-g++ --version x86_64-w64-mingw32-g++ (GCC) 7.4.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ x86_64-w64-mingw32-g++ -v Using built-in specs. COLLECT_GCC=x86_64-w64-mingw32-g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-w64-mingw32/7.4.0/lto-wrapper.exe Target: x86_64-w64-mingw32 Configured with: /cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0/configure --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/mingw64-x86_64-gcc --htmldir=/usr/share/doc/mingw64-x86_64-gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-w64-mingw32 --without-libiconv-prefix --without-libintl-prefix --with-sysroot=/usr/x86_64-w64-mingw32/sys-root --with-build-sysroot=/usr/x86_64-w64-mingw32/sys-root --disable-multilib --disable-win32-registry --enable-languages=c,c++,fortran,lto,objc,obj-c++ --enable-fully-dynamic-string --enable-graphite --enable-libgomp --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-version-specific-runtime-libs --enable-libgomp --enable-libada --with-dwarf2 --with-gnu-ld --with-gnu-as --with-tune=generic --with-cloog-include=/usr/include/cloog-isl --with-system-zlib --enable-threads=posix --libexecdir=/usr/lib Thread model: posix gcc version 7.4.0 (GCC)