[Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 Bug ID: 99728 Summary: code pessimization when using wrapper classes around SIMD types Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 50456 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50456&action=edit test case I originally reported this at https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html, but I'm now fairly confident that this warrants a PR. The test case needs to be processed on x86_64 with the command g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc Code for two functions will be generated, and I would expect that the generated assembler for both should be identical. However, for the version using the wrapper class around __m256d, g++ does not seem to recognize the dead stores at the end of the loop and leaves them inside the loop body instead of moving them after the final jump instruction of the loop, which reduces performance considerably. clang++ generates nearly identical code for both functions and manages to remove the dead stores, so I think that g++ might be able to do better here and is not pessimizing the code due to some C++ intricacies.
[Bug c++/99728] code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 --- Comment #1 from Martin Reinecke --- Created attachment 50457 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50457&action=edit generated assembler
[Bug c++/99728] code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 --- Comment #2 from Martin Reinecke --- Created attachment 50458 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50458&action=edit additional test case by Alexander Monakov
[Bug c++/99728] code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 --- Comment #5 from Martin Reinecke --- (In reply to Matthias Kretz (Vir) from comment #4) > FWIW, using std::experimental::native_simd also does not hoist the > stores out of the loop. However, if you pass d by value and return d, the > issue goes away. So I guess this is an aliasing pessimization. This is an interesting data point ... In my first test case (attached to https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html), I explicitly make a local copy of d and copy back at the end of the function, and this didn't help. Strange. > Even though > you added __restrict__. In any case __m256 has the problem that it is > declared with the may_alias attribute. I recommend to just never use __m256 > unless you have no other choice. I guess I need it for unaligned loads/stores, correct? Otherwise __v4df should work everywhere.
[Bug c++/99728] code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 --- Comment #7 from Martin Reinecke --- Thanks! (BTW, I'm aware your code and will immediately switch to it once it lands in gcc! But for the time being I try to make do with my poor man's version to avoid the external dependency.)
[Bug c++/97564] New: [11.0 regression] pybind11 compilation failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97564 Bug ID: 97564 Summary: [11.0 regression] pybind11 compilation failure Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 49439 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49439&action=edit preprocessed test case The attached testcase was distilled from glue code using Pybind11. It is compiled without complaint by older g++ versions, but the development version rejects it with: artin@debian:~/codes/ducc$ g++ -c test.i -std=c++17 -Wfatal-errors In file included from /usr/lib/python3/dist-packages/pybind11/include/pybind11/attr.h:13, from /usr/lib/python3/dist-packages/pybind11/include/pybind11/pybind11.h:44, from test.cc:1: /usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h: In instantiation of ‘typename pybind11::detail::make_caster::cast_op_type::type> pybind11::detail::cast_op(pybind11::detail::make_caster&&) [with T = std::__cxx11::basic_string; typename pybind11::detail::make_caster::cast_op_type::type> = std::__cxx11::basic_string&&; pybind11::detail::make_caster = pybind11::detail::type_caster, void>; typename std::add_rvalue_reference<_Tp>::type = std::__cxx11::basic_string&&]’: /usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:1707:22: required from ‘T pybind11::cast(const pybind11::handle&) [with T = std::__cxx11::basic_string; typename std::enable_if<(! std::is_base_of::type>::value), int>::type = 0]’ /usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:1725:72: required from ‘T pybind11::handle::cast() const [with T = std::__cxx11::basic_string]’ /usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:446:77: required from here /usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:950:43: error: no type named ‘make_caster’ in ‘std::remove_reference, void>&>::type’ {aka ‘class pybind11::detail::type_caster, void>’} 949 | return std::move(caster).operator |~~ 950 | typename make_caster::template cast_op_type::type>(); | ~~^ compilation terminated due to -Wfatal-errors.
[Bug tree-optimization/98516] New: Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516 Bug ID: 98516 Summary: Wrong code generated by tree vectorizer Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 49879 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49879&action=edit Code to reproduce the problem The attached code is a distilled test case from an FFT library, which works fine with current released GCC versions, but produces incorrect results with current trunk when tree optimization is switched on via -O3: martin@debian:~/codes/ducc$ g++ -I src/ -std=c++17 -O3 -march=native -ffast-math bug.cc martin@debian:~/codes/ducc$ ./a.out (0.362978,0.601326) (0.362155,0.18782) (1.63193,-0.0779749) (1.26662,0.327246) (-1.0024,1.03302) The third complex number in the result line is wrong. When disabling tree vectorization (or when using a released GCC version), the correct answer is produced: martin@debian:~/codes/ducc$ g++ -I src/ -std=c++17 -O3 -march=native -ffast-math bug.cc -fno-tree-vectorize martin@debian:~/codes/ducc$ ./a.out (0.362978,0.601326) (0.362155,0.18782) (0.380433,0.228703) (1.26662,0.327246) (-1.0024,1.03302) My gcc version is martin@debian:~/codes/ducc$ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/home/martin/codes/umaster/libexec/gcc/x86_64-pc-linux-gnu/11.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /home/martin/codes/gccgit/configure --disable-bootstrap --disable-multilib --prefix=/home/martin/codes/umaster --enable-languages=c++,fortran --enable-target=all --enable-checking=release Thread model: posix Supported LTO compression algorithms: zlib gcc version 11.0.0 20210103 (experimental) [master revision 37d0bb1f5b5:d78f978936b:3335c9f954f8939403eabb5ad3d8739be9984f81] (GCC) I have tried to narrow down this failure further, but without success so far. It's quite possible that the mistake is on my side, but using the sanitizers and valgrind I have not been able to find anything. Maybe a git bisect could locate the commit that introduced the change in behaviour.
[Bug tree-optimization/98516] Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516 --- Comment #1 from Martin Reinecke --- Minimal set of flags to trigger the problem seems to be g++ -std=c++17 -O1 -ftree-vectorize -fno-signed-zeros bug.cc
[Bug tree-optimization/98516] [11 Regression] Wrong code generated by tree vectorizer since r11-3823-g126ed72b9f48f853
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516 --- Comment #9 from Martin Reinecke --- Thanks, this fixes the reduced test case for me as well! Unfortunately there seems to be more where this one came from, since my comprehensive test suite still fails ... I'll try to produce test cases and open another bug report.
[Bug c++/98544] New: [11 regression] Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 Bug ID: 98544 Summary: [11 regression] Wrong code generated by tree vectorizer Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 49893 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49893&action=edit test case The attached test case (sorry, I don't have the time to reduce this properly at the moment ...) tests real-valued FFTs of different lengths and should not output anything if compiled properly. However, with current trunk and the following compiler command line, some of the results are really wrong: martin@debian:~/codes/ducc/bug$ g++ -std=c++17 -O1 -mavx -ftree-vectorize bug2.cc martin@debian:~/codes/ducc/bug$ ./a.out problem at length 15; discrepancy is 4.40898 problem at length 20; discrepancy is 4.54691 problem at length 21; discrepancy is 3.70442 problem at length 25; discrepancy is 8.40318 problem at length 27; discrepancy is 8.44956 problem at length 28; discrepancy is 3.12203 problem at length 30; discrepancy is 8.81795 The discrepancies should be below 1e-15. The failure goes away when removing either "-ftree-vectorize" or "-mavx". On released versions of gcc, it runs fine.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #1 from Martin Reinecke --- Problem seems to be related to the use of __restrict__. If I remove the DUCC0_RESTRICT from the function definitions of "radb3", "radb4" etc., the problem goes away. However I don't see where I'm violating the promise made by __restrict__ in these functions ...
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #13 from Martin Reinecke --- > What kind of shape (w/o too much guessing) is the function expecting for its > input arrays? For radb the size of the cc and ch arrays is l1*ido*x. Size of wa is (x-1)*ido.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #15 from Martin Reinecke --- "Problem at length N" means that the FFT of length N is computed incorrectly. Also, N==l1*ido*x. For an FFT of length N, the computation is broken down into several passes. Let's take N=15. First the prome factors of N are computed, in this case 3 and 5. The (simplified) procedure is then: l1=1; // first pass with x=3 x=3; ido=N/(x*l1); radb3(ido, l1, p1, p2, ); swap (p1,p2); l1*=x; x=5; ido=N/(x*l1); radb5(ido, l1, p1, p2, ); swap (p1,p2);
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #22 from Martin Reinecke --- Brilliant, thank you very much for tracking this one down! My FFT library now works correctly again with all optimizations enabled, which is a great relief. The scipy maintainers will be happy that they won't need to fiddle with flags depending on gcc versions :)
[Bug libstdc++/103805] New: Inconsistent exception specifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805 Bug ID: 103805 Summary: Inconsistent exception specifications Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- I'm not really sure how to report this one properly, so please let me know if crucial information is missing! It seems that some functions in the libstdc++ header files shipped with g++ 11.2.0 have inconsistent exception specification fora few functions. g++ itself doesn't seem to care, but clang++-13 is unhappy, providing the error message: clang-13 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPKGNAME=ducc0 -DPKGVERSION=0.22.0 -I. -I./src/ -I/home/martin/.local/lib/python3.9/site-packages/pybind11/include -I/home/martin/.local/lib/python3.9/site-packages/pybind11/include -I/usr/include/python3.9 -c python/ducc.cc -o build/temp.linux-x86_64-3.9/python/ducc.o -std=c++17 -fvisibility=hidden -g0 -ffast-math -O3 -march=native -Wfatal-errors -Wfloat-conversion -W -Wall -Wstrict-aliasing -Wwrite-strings -Wredundant-decls -Woverloaded-virtual -Wcast-qual -Wcast-align -Wpointer-arith -pthread In file included from python/ducc.cc:12: In file included from ./python/fft_pymod.cc:41: In file included from /home/martin/.local/lib/python3.9/site-packages/pybind11/include/pybind11/stl.h:21: /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/valarray:1215:5: fatal error: exception specification in declaration does not match previous declaration begin(valarray<_Tp>& __va) noexcept ^ /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/range_access.h:107:31: note: previous declaration is here template _Tp* begin(valarray<_Tp>&); ^ At first glance, clang seems to be perfectly right in complaining about this, but I'm not sure how much libstdc++ is supposed to be interoperable with other compilers. Anyway, if the C++ standard mandates that all declarations have the same exception specification and g++ just doesn't enforce this at the moment, it might still be good to update the headers to be more future-proof.
[Bug libstdc++/103805] Inconsistent exception specifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805 --- Comment #4 from Martin Reinecke --- Sorry if I specified the wrong version. My local (Debian unstable) g++ reports martin@marvin:~/codes/ducc$ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 11.2.0-12' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-RMIFfM/gcc-11-11.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-RMIFfM/gcc-11-11.2.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.2.0 (Debian 11.2.0-12) Not sure how I got the libstdc++ 11.2.1 then, maybe some Debian packaging issue. Anyway I"m very glad that this is already fixed!
[Bug libstdc++/103805] Inconsistent exception specifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805 --- Comment #6 from Martin Reinecke --- Ouch. That reminds me when Redhat(?) did the same many years ago and caused no end of confusion. Anyway, sorry for the noise!
[Bug c/103850] New: missed optimization in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 Bug ID: 103850 Summary: missed optimization in AVX code Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 52076 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit test case (I'm reporting this under "C" because I don't know which optimizer is responsible for this, but I observe the same beaviour in C++ programs as well.) This test case was distilled from a hot loop in a library computing spherical harmonic transforms. Apparently it can be compiled in a way that gives close to theoretical peak performance at least on my hardware (Zen 2), but this only happens if the statements in the inner loop are arranged in a specific way. Trivial rearrangements result in a performance which is about 30% lower. I would have expected that gcc would be able to spot this kind of rearrangement and do it by itself, but this doesn't seem the case at the moment. If that could be fixed, that would obviously be great, but if not, I'd be grateful for any tips how the most "efficient" arrangements can be found for such critical loops without resorting to trial and error. The loops in question start at lines 27 and 78 in the attached test case. On my machine the code reports slow kernel version: 45.317578 GFlops/s fast kernel version: 67.083952 GFlops/s when compiled with "-O3 -march=znver2 -ffast-math -W -Wall" Clang and Intel icx show the same discrepancy, so it seems that the required re-ordering is indeed hard to do.
[Bug target/103850] missed optimization in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #2 from Martin Reinecke --- Thanks! This flag indeed causes both kernels to have the same speed, but (at least for me) it's slower than both original versions... slow kernel version: 29.027915 GFlops/s fast kernel version: 29.008313 GFlops/s Strange.
[Bug target/103850] missed optimization in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #3 from Martin Reinecke --- Just for completeness, this is the CPU I'm running on: vendor_id : AuthenticAMD cpu family : 23 model : 96 model name : AMD Ryzen 7 4800H with Radeon Graphics stepping: 1 microcode : 0x8600103
[Bug target/103850] missed optimization in AVX code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #6 from Martin Reinecke --- I would have expected that this does not make a significant difference, assuming that speculative execution works and the branch predictor takes the jump backwards at the loop's end. In that picture both versions of the loop should look exactly the same. But my knowledge about all this is admittedly really vague...
[Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728 --- Comment #12 from Martin Reinecke --- Any hope of addressing this for gcc 12? I have a real-world test case where this effect causes roughly 15-20% slowdown, and I expect that with the wider availability of std::simd types more people will encounter this soon. (And the workaround pretty much defeats the purpose of having such convenient functionality as std::simd in the first place...) I'm happy to help out in any way I can, but unfortunately I'm more of a numerics guy than a compiler expert.