https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108030
Bug ID: 108030 Summary: `std::experimental::simd` not inlined Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: bernhardmgruber at gmail dot com Target Milestone: --- Created attachment 54052 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54052&action=edit Diff we applied to a local copy of the <experimental/simd>headers. We tried to explicitely vectorize a C++ function using `std::experimental::simd` in our particle-in-cell simulation [picongpu](https://github.com/ComputationalRadiationPhysics/picongpu). The function is already called from a long call tree of functions marked `__attribute__((always_inline))`. Profiling the code shows that several constructs of `std::experimental::simd` where not inlined, leading to catastrophic performance (several times slower than scalar code). We compiled, among other flags, with: ``` -g -march=native -mtune=native -fopenmp -O3 -DNDEBUG -pthread -std=c++17 ``` We mostly used multiplication/addition as well as the broadcast and generator constructors of SIMD types. We saw several calls to `_S_multiplies` (IIRC) and `_S_generate`/`_S_generator` that were not inlined, depending on whether we used `std::experimental::native_simd` or `std::experimental::fixed_size_simd`. Upon inspection of the `<experimental/bits/simd_*>` headers, we saw that several functions are not annotated with `_GLIBCXX_SIMD_INTRINSIC` or other ways to force inlining. We think this is a missed optimization opportunity. We tried `-finline-limit=1000000` without success. We thus applied `_GLIBCXX_SIMD_INTRINSIC` and `__attribute__((always_inline))` to functions from the SIMD headers that showed up in the profiler (perf) until all calls were inlined. Please apply further attributes to SIMD intrinsics to force their inlining. Mind, that this also affects lambda expressions. I have attached a diff which our changes to the SIMD headers, but we also bulk replaced several declaration specifiers, so we may have added more force-inlines than potentially necessary.