[Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Bug ID: 99728
   Summary: code pessimization when using wrapper classes around
SIMD types
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 50456
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50456&action=edit
test case

I originally reported this at
https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html, but I'm now
fairly confident that this warrants a PR.

The test case needs to be processed on x86_64 with the command

g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc

Code for two functions will be generated, and I would expect that the generated
assembler for both should be identical. However, for the version using the
wrapper class around __m256d, g++ does not seem to recognize the dead stores at
the end of the loop and leaves them inside the loop body instead of moving them
after the final jump instruction of the loop, which reduces performance
considerably.

clang++ generates nearly identical code for both functions and manages to
remove the dead stores, so I think that g++ might be able to do better here and
is not pessimizing the code due to some C++ intricacies.

[Bug c++/99728] code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #1 from Martin Reinecke  ---
Created attachment 50457
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50457&action=edit
generated assembler

[Bug c++/99728] code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #2 from Martin Reinecke  ---
Created attachment 50458
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50458&action=edit
additional test case by Alexander Monakov

[Bug c++/99728] code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #5 from Martin Reinecke  ---
(In reply to Matthias Kretz (Vir) from comment #4)
> FWIW, using std::experimental::native_simd also does not hoist the
> stores out of the loop. However, if you pass d by value and return d, the
> issue goes away. So I guess this is an aliasing pessimization.

This is an interesting data point ... In my first test case (attached to
https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html), I explicitly
make a local copy of d and copy back at the end of the function, and this
didn't help. Strange.

> Even though
> you added __restrict__. In any case __m256 has the problem that it is
> declared with the may_alias attribute. I recommend to just never use __m256
> unless you have no other choice.

I guess I need it for unaligned loads/stores, correct? Otherwise __v4df should
work everywhere.

[Bug c++/99728] code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #7 from Martin Reinecke  ---
Thanks!

(BTW, I'm aware your code and will immediately switch to it once it lands in
gcc! But for the time being I try to make do with my poor man's version to
avoid the external dependency.)

[Bug c++/97564] New: [11.0 regression] pybind11 compilation failure

2020-10-24 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97564

Bug ID: 97564
   Summary: [11.0 regression] pybind11 compilation failure
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 49439
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49439&action=edit
preprocessed test case

The attached testcase was distilled from glue code using Pybind11. It is
compiled without complaint by older g++ versions, but the development version
rejects it with:

artin@debian:~/codes/ducc$ g++ -c test.i -std=c++17 -Wfatal-errors
In file included from
/usr/lib/python3/dist-packages/pybind11/include/pybind11/attr.h:13,
 from
/usr/lib/python3/dist-packages/pybind11/include/pybind11/pybind11.h:44,
 from test.cc:1:
/usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h: In
instantiation of ‘typename
pybind11::detail::make_caster::cast_op_type::type>
pybind11::detail::cast_op(pybind11::detail::make_caster&&) [with T =
std::__cxx11::basic_string; typename
pybind11::detail::make_caster::cast_op_type::type> = std::__cxx11::basic_string&&;
pybind11::detail::make_caster =
pybind11::detail::type_caster, void>; typename
std::add_rvalue_reference<_Tp>::type = std::__cxx11::basic_string&&]’:
/usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:1707:22:  
required from ‘T pybind11::cast(const pybind11::handle&) [with T =
std::__cxx11::basic_string; typename std::enable_if<(!
std::is_base_of::type>::value), int>::type  = 0]’
/usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:1725:72:  
required from ‘T pybind11::handle::cast() const [with T =
std::__cxx11::basic_string]’
/usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:446:77:  
required from here
/usr/lib/python3/dist-packages/pybind11/include/pybind11/cast.h:950:43: error:
no type named ‘make_caster’ in
‘std::remove_reference,
void>&>::type’ {aka ‘class
pybind11::detail::type_caster, void>’}
  949 | return std::move(caster).operator
  |~~  
  950 | typename make_caster::template cast_op_type::type>();
  |
~~^
compilation terminated due to -Wfatal-errors.

[Bug tree-optimization/98516] New: Wrong code generated by tree vectorizer

2021-01-04 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516

Bug ID: 98516
   Summary: Wrong code generated by tree vectorizer
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 49879
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49879&action=edit
Code to reproduce the problem

The attached code is a distilled test case from an FFT library, which works
fine with current released GCC versions, but produces incorrect results with
current trunk when tree optimization is switched on via -O3:

martin@debian:~/codes/ducc$ g++ -I src/ -std=c++17 -O3 -march=native
-ffast-math bug.cc
martin@debian:~/codes/ducc$ ./a.out
(0.362978,0.601326) (0.362155,0.18782) (1.63193,-0.0779749) (1.26662,0.327246)
(-1.0024,1.03302) 

The third complex number in the result line is wrong.
When disabling tree vectorization (or when using a released GCC version), the
correct answer is produced:

martin@debian:~/codes/ducc$ g++ -I src/ -std=c++17 -O3 -march=native
-ffast-math bug.cc -fno-tree-vectorize
martin@debian:~/codes/ducc$ ./a.out
(0.362978,0.601326) (0.362155,0.18782) (0.380433,0.228703) (1.26662,0.327246)
(-1.0024,1.03302) 

My gcc version is

martin@debian:~/codes/ducc$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/home/martin/codes/umaster/libexec/gcc/x86_64-pc-linux-gnu/11.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/martin/codes/gccgit/configure --disable-bootstrap
--disable-multilib --prefix=/home/martin/codes/umaster
--enable-languages=c++,fortran --enable-target=all --enable-checking=release
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20210103 (experimental) [master revision
37d0bb1f5b5:d78f978936b:3335c9f954f8939403eabb5ad3d8739be9984f81] (GCC) 

I have tried to narrow down this failure further, but without success so far.
It's quite possible that the mistake is on my side, but using the sanitizers
and valgrind I have not been able to find anything.

Maybe a git bisect could locate the commit that introduced the change in
behaviour.

[Bug tree-optimization/98516] Wrong code generated by tree vectorizer

2021-01-04 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516

--- Comment #1 from Martin Reinecke  ---
Minimal set of flags to trigger the problem seems to be

g++ -std=c++17 -O1 -ftree-vectorize -fno-signed-zeros bug.cc

[Bug tree-optimization/98516] [11 Regression] Wrong code generated by tree vectorizer since r11-3823-g126ed72b9f48f853

2021-01-05 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516

--- Comment #9 from Martin Reinecke  ---
Thanks, this fixes the reduced test case for me as well!

Unfortunately there seems to be more where this one came from, since my
comprehensive test suite still fails ... I'll try to produce test cases and
open another bug report.

[Bug c++/98544] New: [11 regression] Wrong code generated by tree vectorizer

2021-01-05 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

Bug ID: 98544
   Summary: [11 regression] Wrong code generated by tree
vectorizer
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 49893
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49893&action=edit
test case

The attached test case (sorry, I don't have the time to reduce this properly at
the moment ...) tests real-valued FFTs of different lengths and should not
output anything if compiled properly.

However, with current trunk and the following compiler command line, some of
the results are really wrong:

martin@debian:~/codes/ducc/bug$ g++ -std=c++17 -O1 -mavx -ftree-vectorize
bug2.cc
martin@debian:~/codes/ducc/bug$ ./a.out
problem at length 15; discrepancy is 4.40898
problem at length 20; discrepancy is 4.54691
problem at length 21; discrepancy is 3.70442
problem at length 25; discrepancy is 8.40318
problem at length 27; discrepancy is 8.44956
problem at length 28; discrepancy is 3.12203
problem at length 30; discrepancy is 8.81795

The discrepancies should be below 1e-15.

The failure goes away when removing either "-ftree-vectorize" or "-mavx". On
released versions of gcc, it runs fine.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer

2021-01-05 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #1 from Martin Reinecke  ---
Problem seems to be related to the use of __restrict__.

If I remove the DUCC0_RESTRICT from the function definitions of "radb3",
"radb4" etc., the problem goes away.

However I don't see where I'm violating the promise made by __restrict__ in
these functions ...

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #13 from Martin Reinecke  ---
> What kind of shape (w/o too much guessing) is the function expecting for its 
> input arrays?

For radb the size of the cc and ch arrays is l1*ido*x.
Size of wa is (x-1)*ido.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #15 from Martin Reinecke  ---
"Problem at length N" means that the FFT of length N is computed incorrectly.
Also, N==l1*ido*x.

For an FFT of length N, the computation is broken down into several passes.
Let's take N=15.
First the prome factors of N are computed, in this case 3 and 5.

The (simplified) procedure is then:

l1=1;

// first pass with x=3
x=3;
ido=N/(x*l1);
radb3(ido, l1, p1, p2, );
swap (p1,p2);

l1*=x;
x=5;
ido=N/(x*l1);
radb5(ido, l1, p1, p2, );
swap (p1,p2);

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #22 from Martin Reinecke  ---
Brilliant, thank you very much for tracking this one down!
My FFT library now works correctly again with all optimizations enabled, which
is a great relief. The scipy maintainers will be happy that they won't need to
fiddle with flags depending on gcc versions :)

[Bug libstdc++/103805] New: Inconsistent exception specifications

2021-12-22 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805

Bug ID: 103805
   Summary: Inconsistent exception specifications
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

I'm not really sure how to report this one properly, so please let me know if
crucial information is missing!

It seems that some functions in the libstdc++ header files shipped with g++
11.2.0 have inconsistent exception specification fora few functions. g++ itself
doesn't seem to care, but clang++-13 is unhappy, providing the error message:

clang-13 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g
-fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
-D_FORTIFY_SOURCE=2 -fPIC -DPKGNAME=ducc0 -DPKGVERSION=0.22.0 -I. -I./src/
-I/home/martin/.local/lib/python3.9/site-packages/pybind11/include
-I/home/martin/.local/lib/python3.9/site-packages/pybind11/include
-I/usr/include/python3.9 -c python/ducc.cc -o
build/temp.linux-x86_64-3.9/python/ducc.o -std=c++17 -fvisibility=hidden -g0
-ffast-math -O3 -march=native -Wfatal-errors -Wfloat-conversion -W -Wall
-Wstrict-aliasing -Wwrite-strings -Wredundant-decls -Woverloaded-virtual
-Wcast-qual -Wcast-align -Wpointer-arith -pthread
In file included from python/ducc.cc:12:
In file included from ./python/fft_pymod.cc:41:
In file included from
/home/martin/.local/lib/python3.9/site-packages/pybind11/include/pybind11/stl.h:21:
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/valarray:1215:5:
fatal error: exception specification in declaration does not match previous
declaration
begin(valarray<_Tp>& __va) noexcept
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/range_access.h:107:31:
note: previous declaration is here
  template _Tp* begin(valarray<_Tp>&);
  ^

At first glance, clang seems to be perfectly right in complaining about this,
but I'm not sure how much libstdc++ is supposed to be interoperable with other
compilers.

Anyway, if the C++ standard mandates that all declarations have the same
exception specification and g++ just doesn't enforce this at the moment, it
might still be good to update the headers to be more future-proof.

[Bug libstdc++/103805] Inconsistent exception specifications

2021-12-22 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805

--- Comment #4 from Martin Reinecke  ---
Sorry if I specified the wrong version. My local (Debian unstable) g++ reports

martin@marvin:~/codes/ducc$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 11.2.0-12'
--with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-11-RMIFfM/gcc-11-11.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-RMIFfM/gcc-11-11.2.0/debian/tmp-gcn/usr
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.2.0 (Debian 11.2.0-12) 

Not sure how I got the libstdc++ 11.2.1 then, maybe some Debian packaging
issue.

Anyway I"m very glad that this is already fixed!

[Bug libstdc++/103805] Inconsistent exception specifications

2021-12-22 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103805

--- Comment #6 from Martin Reinecke  ---
Ouch. That reminds me when Redhat(?) did the same many years ago and caused no
end of confusion. Anyway, sorry for the noise!

[Bug c/103850] New: missed optimization in AVX code

2021-12-28 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

Bug ID: 103850
   Summary: missed optimization in AVX code
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mar...@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 52076
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit
test case

(I'm reporting this under "C" because I don't know which optimizer is
responsible for this, but I observe the same beaviour in C++ programs as well.)

This test case was distilled from a hot loop in a library computing spherical
harmonic transforms. Apparently it can be compiled in a way that gives close to
theoretical peak performance at least on my hardware (Zen 2), but this only
happens if the statements in the inner loop are arranged in a specific way.
Trivial rearrangements result in a performance which is about 30% lower.

I would have expected that gcc would be able to spot this kind of rearrangement
and do it by itself, but this doesn't seem the case at the moment. If that
could be fixed, that would obviously be great, but if not, I'd be grateful for
any tips how the most "efficient" arrangements can be found for such critical
loops without resorting to trial and error.

The loops in question start at lines 27 and 78 in the attached test case.
On my machine the code reports

slow kernel version: 45.317578 GFlops/s
fast kernel version: 67.083952 GFlops/s

when compiled with "-O3 -march=znver2 -ffast-math -W -Wall"

Clang and Intel icx show the same discrepancy, so it seems that the required
re-ordering is indeed hard to do.

[Bug target/103850] missed optimization in AVX code

2021-12-28 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #2 from Martin Reinecke  ---
Thanks! This flag indeed causes both kernels to have the same speed, but (at
least for me) it's slower than both original versions...

slow kernel version: 29.027915 GFlops/s
fast kernel version: 29.008313 GFlops/s

Strange.

[Bug target/103850] missed optimization in AVX code

2021-12-28 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #3 from Martin Reinecke  ---
Just for completeness, this is the CPU I'm running on:

vendor_id   : AuthenticAMD
cpu family  : 23
model   : 96
model name  : AMD Ryzen 7 4800H with Radeon Graphics
stepping: 1
microcode   : 0x8600103

[Bug target/103850] missed optimization in AVX code

2022-01-04 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #6 from Martin Reinecke  ---
I would have expected that this does not make a significant difference,
assuming that speculative execution works and the branch predictor takes the
jump backwards at the loop's end. In that picture both versions of the loop
should look exactly the same.
But my knowledge about all this is admittedly really vague...

[Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types

2021-07-02 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #12 from Martin Reinecke  ---
Any hope of addressing this for gcc 12?
I have a real-world test case where this effect causes roughly 15-20% slowdown,
and I expect that with the wider availability of std::simd types more people
will encounter this soon. (And the workaround pretty much defeats the purpose
of having such convenient functionality as std::simd in the first place...)

I'm happy to help out in any way I can, but unfortunately I'm more of a
numerics guy than a compiler expert.