[Bug libstdc++/108030] `std::experimental::simd` not inlined

2023-02-20 Thread bernhardmgruber at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108030

--- Comment #7 from Bernhard Manfred Gruber  
---
I just want to tell you that I very much appreciate the work you are doing
here! Thank you for the time and thorough commitment!

[Bug c++/102855] New: #pragma GCC unroll n should support n being a template parameter

2021-10-20 Thread bernhardmgruber at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102855

Bug ID: 102855
   Summary: #pragma GCC unroll n should support n being a template
parameter
   Product: gcc
   Version: 11.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bernhardmgruber at gmail dot com
  Target Milestone: ---

Using `#pragma GCC unroll n` for a loop fails to compile when `n` is a template
parameter. Given the following code:

```c++
void f(int);

template 
void g() {
#pragma GCC unroll n
for (int i = 0; i < n; i++)
f(i);
}

int main() {
constexpr auto n = 100;
g();
}
```
g++-11.2 produces the following diagnostic:
```
source>: In function 'void g()':
:5:24: error: '#pragma GCC unroll' requires an assignment-expression
that evaluates to a non-negative integral constant less than 65535
5 | #pragma GCC unroll n
  |^
```
If `g` is turned into a normal function and the variable `n` is moved form
`main` to `g`, the example compiles fine and produces the expected behavior.

Allowing `n` inside the pragma to be a template parameter helps temendously in
optimizing hot loops in templated numeric codes. Such a feature is also
supported by clang, icc and nvcc. I want to kindly ask you to provide this
additional functionality.

Thank you!

Example on godbolt:
https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:2,endLineNumber:8,positionColumn:2,positionLineNumber:8,selectionStartColumn:2,selectionStartLineNumber:8,startColumn:2,startLineNumber:8),source:'void+f(int)%3B%0A%0Atemplate+%3Cint+n%3E%0Avoid+g()+%7B%0A%23pragma+GCC+unroll+n%0Afor+(int+i+%3D+0%3B+i+%3C+n%3B+i%2B%2B)%0Af(i)%3B%0A%7D%0A%0Aint+main()+%7B%0Aconstexpr+auto+n+%3D+100%3B%0Ag%3Cn%3E()%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:54.105263157894754,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:g112,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:4,lang:c%2B%2B,libs:!(),options:'-O3+-Wall+-Wextra',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64+gcc+11.2+(C%2B%2B,+Editor+%231,+Compiler+%234)',t:'0')),k:20.004,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compiler:4,editor:1,fontScale:14,fontUsePx:'0',tree:'1',wrap:'1'),l:'5',n:'0',o:'Output+of+x86-64+gcc+11.2+(Compiler+%234)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:45.894736842105274,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4

[Bug libstdc++/108030] New: `std::experimental::simd` not inlined

2022-12-09 Thread bernhardmgruber at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108030

Bug ID: 108030
   Summary: `std::experimental::simd` not inlined
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bernhardmgruber at gmail dot com
  Target Milestone: ---

Created attachment 54052
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54052&action=edit
Diff we applied to a local copy of the headers.

We tried to explicitely vectorize a C++ function using
`std::experimental::simd` in our particle-in-cell simulation
[picongpu](https://github.com/ComputationalRadiationPhysics/picongpu). The
function is already called from a long call tree of functions marked
`__attribute__((always_inline))`. Profiling the code shows that several
constructs of `std::experimental::simd` where not inlined, leading to
catastrophic performance (several times slower than scalar code).

We compiled, among other flags, with:
```
-g
-march=native
-mtune=native
-fopenmp
-O3
-DNDEBUG
-pthread
-std=c++17
```

We mostly used multiplication/addition as well as the broadcast and generator
constructors of SIMD types. We saw several calls to `_S_multiplies` (IIRC) and
`_S_generate`/`_S_generator` that were not inlined, depending on whether we
used `std::experimental::native_simd` or `std::experimental::fixed_size_simd`.

Upon inspection of the `` headers, we saw that
several functions are not annotated with `_GLIBCXX_SIMD_INTRINSIC` or other
ways to force inlining. We think this is a missed optimization opportunity.

We tried `-finline-limit=100` without success.

We thus applied `_GLIBCXX_SIMD_INTRINSIC` and `__attribute__((always_inline))`
to functions from the SIMD headers that showed up in the profiler (perf) until
all calls were inlined.

Please apply further attributes to SIMD intrinsics to force their inlining.
Mind, that this also affects lambda expressions.

I have attached a diff which our changes to the SIMD headers, but we also bulk
replaced several declaration specifiers, so we may have added more
force-inlines than potentially necessary.