[Bug libstdc++/114417] New: simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 Bug ID: 114417 Summary: simd parameters are passed by memory on x64 , not using the available sse registers Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- https://godbolt.org/z/3GYnadqc1 In current implementation, SIMD parameters are passed by memory, while the equivalent vector parameters are passed by SSE registers. If the equivalent vector parameters can be passed by SSE registers, can we use SSE registers for SIMD parameters? Maybe the performance difference is not so significant, but I just want to keep everything in registers.
[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #2 from Imple Lee --- (In reply to Andrew Pinski from comment #1) > I doubt this can change since this is the abi gcc decided on a long time ago. If we implement the simd class as a wrapper around a vector, the parameter can still be passed by sse registers, so I think there may be an implementation issue in libstdc++'s implementation of stdx::simd. https://godbolt.org/z/a6s67zzc7
[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #3 from Imple Lee --- Oh, I didn't make it clear. I am describing libstdc++'s std::experimental::simd class.
[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #7 from Imple Lee --- I tried to dig into the source code and it seems like it was designed to be "passed via the stack". Not sure whether this was specified by the specification (did not find relevant requirements, but I am not quite familiar with that) or just an implementation choice. In GCC git tree [libstdc++-v3/include/experimental/bits/simd_fixed_size.h, line 27](https://gcc.gnu.org/git?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/experimental/bits/simd_fixed_size.h;h=408855212979cc32699db0805079ac74f495a8fa;hb=HEAD#l27): ... * The fixed_size ABI gives the following guarantees: * - simd objects are passed via the stack ...
[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #11 from Imple Lee --- > What you want to use instead is std::experimental::simd_abi::deduce_t. > That'll give you a not-fixed_size ABI if one exists. And those will likely be > passed via registers (as long as the psABI allows). Great! It does work as intended. Thank you for telling me that. Maybe all I need is just to read the docs on cppref more carefully :|
[Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908 Bug ID: 114908 Summary: fails to optimize avx2 in-register permute written with std::experimental::simd Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- I am trying to write simd code with std::experimental::simd. Here is the same function written in both std::experimental::simd and GNU vector extension versions (available online at https://godbolt.org/z/dc169rY3o ). The purpose is to permute the register from [w, x, y, z] into [0, w, x, y]. ```c++ #include #include namespace stdx = std::experimental; using data_t = std::uint64_t; constexpr std::size_t data_size = 4; template using simd_of = std::experimental::simd>; using simd_t = simd_of; template constexpr simd_of zero = {}; // stdx version simd_t permute_simd(simd_t data) { auto [carry, _] = split(data); return concat(zero<1>, carry); } typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]]; constexpr vector_t zero_v = {0}; // gnu vector extension version vector_t permute_vector(vector_t data) { return __builtin_shufflevector(data, zero_v, 4, 0, 1, 2); } ``` The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`. Although they should have the same functionality, generated assembly (by GCC) is so different. ```asm permute_simd(std::experimental::parallelism_v2::simd >): pushq %rbp vpxor %xmm1, %xmm1, %xmm1 movq %rsp, %rbp andq $-32, %rsp subq $8, %rsp vmovdqa %ymm0, -120(%rsp) vmovdqa %ymm1, -56(%rsp) movq -104(%rsp), %rax vmovdqa %xmm0, -56(%rsp) movq -48(%rsp), %rdx movq $0, -88(%rsp) movq %rax, -40(%rsp) movq -56(%rsp), %rax vmovdqa -56(%rsp), %ymm2 vmovq %rax, %xmm0 vmovdqa %ymm2, -24(%rsp) movq -8(%rsp), %rax vpinsrq $1, %rdx, %xmm0, %xmm0 vmovdqu %xmm0, -80(%rsp) movq %rax, -64(%rsp) vmovdqa -88(%rsp), %ymm0 leave ret permute_vector(unsigned long __vector(4)): vpxor %xmm1, %xmm1, %xmm1 vpermq $144, %ymm0, %ymm0 vpblendd $3, %ymm1, %ymm0, %ymm0 ret ``` However, Clang can optimize `permute_simd` into the same assembly as `permute_vector`, so I think, instead of a bug in the std::experimental::simd, it is a missed optimization in GCC. ```asm permute_simd(std::experimental::parallelism_v2::simd >): # @permute_simd(std::experimental::parallelism_v2::simd >) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq permute_vector(unsigned long __vector(4)):# @permute_vector(unsigned long __vector(4)) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq ```
[Bug tree-optimization/114966] New: fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966 Bug ID: 114966 Summary: fails to optimize avx2 in-register permute written with std::experimental::simd Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- This is actually another attempt to permute a simd register with std::experimental::simd, like PR114908, but written differently. Following is the same function written in both std::experimental::simd and GNU vector extension versions (available online at https://godbolt.org/z/n3WvqcePo ). The purpose is to permute the register from [w, x, y, z] into [0, w, x, y]. ```c++ #include #include namespace stdx = std::experimental; using data_t = std::uint64_t; constexpr std::size_t data_size = 4; template using simd_of = std::experimental::simd>; using simd_t = simd_of; // stdx version simd_t permute_simd(simd_t data) { return simd_t([=](auto i) -> data_t { constexpr size_t index = i - 1; if constexpr (index < data_size) { return data[index]; } else { return 0; } }); } typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]]; // gnu vector extension version vector_t permute_vector(vector_t data) { return __builtin_shufflevector(data, vector_t{0}, 4, 0, 1, 2); } ``` The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`. Although they should have the same functionality, generated assembly (by GCC) is so different. ```asm permute_simd(std::experimental::parallelism_v2::simd >): vmovq %xmm0, %rax vpsrldq $8, %xmm0, %xmm1 vextracti128 $0x1, %ymm0, %xmm0 vpunpcklqdq %xmm0, %xmm1, %xmm1 vpxor %xmm0, %xmm0, %xmm0 vpinsrq $1, %rax, %xmm0, %xmm0 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 ret permute_vector(unsigned long __vector(4)): vpxor %xmm1, %xmm1, %xmm1 vpermq $144, %ymm0, %ymm0 vpblendd $3, %ymm1, %ymm0, %ymm0 ret ``` However, Clang can optimize `permute_simd` into the same assembly as `permute_vector`, so I think, instead of a bug in the std::experimental::simd, it is a missed optimization in GCC. ```asm permute_simd(std::experimental::parallelism_v2::simd >): # @permute_simd(std::experimental::parallelism_v2::simd >) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq permute_vector(unsigned long __vector(4)):# @permute_vector(unsigned long __vector(4)) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq ```
[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908 --- Comment #8 from Imple Lee --- I tried another way to permute the register. Although GCC does generate simd instructions, the generated code is sub-optimal. I opened PR114966 for that.
[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966 --- Comment #1 from Imple Lee --- This is probably a regression. GCC 13.2 can generate optimal code. See https://godbolt.org/z/4n8ovr7jr .
[Bug libstdc++/115454] New: std::experimental::find_last_set is buggy on x86-64-v4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115454 Bug ID: 115454 Summary: std::experimental::find_last_set is buggy on x86-64-v4 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- We are using a simd with 4 uint64_t elements (with `deduce_t`) on x86-64-v4. We are trying to find the last element with the value -1 (i.e. all 1s). The code is as following (available at https://godbolt.org/z/3f1nszf8E ), compiled with the options `-O3 -march=x86-64-v4 -std=c++20`. ```c++ #include #include using deduce_t_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::deduce_t >; using fixed_size_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::fixed_size<4> >; int f_deduce(deduce_t_element e) { return find_last_set(e != -1); } int f_fixed_size(fixed_size_element e) { return find_last_set(e != -1); } ``` G++ trunk gives the following assembly (I add the comments). ```asm f_deduce(std::experimental::parallelism_v2::simd >): vpcmpeqd %ymm1, %ymm1, %ymm1 movl $-1, %eax vpcmpq $4, %ymm1, %ymm0, %k0 kmovb %k0, %edx orb $-16, %dl# %dl |= 0xf0 je .L1 movzbl %dl, %edx movl $31, %eax lzcntl %edx, %edx# leading zeros is 24 # because the next byte is always 0b subl %edx, %eax # so we get the result is 31 - 24 = 7 .L1: ret f_fixed_size(std::experimental::parallelism_v2::simd >): vpcmpeqd %ymm0, %ymm0, %ymm0 movl $63, %edx vpcmpq $4, (%rdi), %ymm0, %k0 kmovb %k0, %eax andl $15, %eax lzcntq %rax, %rax subl %eax, %edx movl %edx, %eax vzeroupper ret ``` In fact, the first function always gives the result 7 whatever argument it gets, which is more obvious from clang++'s result. ```asm f_deduce(std::experimental::parallelism_v2::simd>): # @f_deduce(std::experimental::parallelism_v2::simd>) movl $7, %eax retq f_fixed_size(std::experimental::parallelism_v2::simd>): # @f_fixed_size(std::experimental::parallelism_v2::simd>) vpcmpeqd %ymm0, %ymm0, %ymm0 vpcmpeqq (%rdi), %ymm0, %k0 kmovd %k0, %eax xorb $15, %al movzbl %al, %eax lzcntq %rax, %rcx movl $63, %eax subl %ecx, %eax vzeroupper retq ``` I don't know why, but compiled result of `fixed_size_simd` seems to be different.
[Bug libstdc++/118416] New: std::experimental::simd code detecting all zero is not optimized to simple ptest on x86-64 avx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118416 Bug ID: 118416 Summary: std::experimental::simd code detecting all zero is not optimized to simple ptest on x86-64 avx Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- The following code uses experimental c++ standard library simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v . ```c++ #include #include namespace stdx = std::experimental; template using simd_of = stdx::simd>; using data_t = simd_of; bool simple_ptest(data_t x) { return all_of(x == 0); } bool ptest_and(data_t a, data_t b) { return all_of((a & b) == 0); } bool ptest_andn(data_t a, data_t b) { return all_of((a & ~b) == 0); } ``` Equivalent assembly (hand-written): ```asm simple_ptest: vptest %xmm0, %xmm0 sete%al ret ptest_and: vptest %xmm0, %xmm1 sete%al ret ptest_andn: vptest %xmm0, %xmm1 setc%al ret ``` But g++ generates the following code at `-O3 -march=x86-64-v3`, and clang++ and even Intel icpx generates almost the same assembly. ```asm simple_ptest(std::experimental::parallelism_v2::simd >): vpxor %xmm1, %xmm1, %xmm1 vpcmpeqd%xmm1, %xmm0, %xmm0 vpcmpeqd%xmm1, %xmm1, %xmm1 vptest %xmm1, %xmm0 setc%al ret ptest_and(std::experimental::parallelism_v2::simd >, std::experimental::parallelism_v2::simd >): vpand %xmm1, %xmm0, %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpcmpeqd%xmm1, %xmm0, %xmm0 vpcmpeqd%xmm1, %xmm1, %xmm1 vptest %xmm1, %xmm0 setc%al ret ptest_andn(std::experimental::parallelism_v2::simd >, std::experimental::parallelism_v2::simd >): vpandn %xmm0, %xmm1, %xmm1 vpxor %xmm0, %xmm0, %xmm0 vpcmpeqd%xmm0, %xmm1, %xmm1 vpcmpeqd%xmm0, %xmm0, %xmm0 vptest %xmm0, %xmm1 setc%al ret ``` I don't know whether this should be a missed optimization in g++ or a libstdc++ issue. Since these compilers generate the same output from the same library code, I guess probably this should be a library issue. Possibly related: PR58790 ?
[Bug libstdc++/118416] std::experimental::simd code detecting all zero is not optimized to simple ptest on x86-64 avx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118416 --- Comment #1 from Imple Lee --- Possibly related: PR90483 .
[Bug libstdc++/118546] New: std::experimental::simd operator== fails to compile with clang++ 19.1.0 on x86-64-v4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118546 Bug ID: 118546 Summary: std::experimental::simd operator== fails to compile with clang++ 19.1.0 on x86-64-v4 Product: gcc Version: 14.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- The following code fails to compile with clang++ 19.1.0 with the compilation options `-march=x86-64-v4 -std=c++20`. Online link: https://godbolt.org/z/sxqsKrv9e . ```c++ #include #include using deduce_t_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::deduce_t >; auto f(deduce_t_element e) { return e == 0; } ``` The error message (copied from godbolt): ``` In file included from :1: In file included from /opt/compiler-explorer/gcc-14.2.0/lib/gcc/x86_64-linux-gnu/14.2.0/../../../../include/c++/14.2.0/experimental/simd:80: /opt/compiler-explorer/gcc-14.2.0/lib/gcc/x86_64-linux-gnu/14.2.0/../../../../include/c++/14.2.0/experimental/bits/simd_x86.h:4232:18: error: static assertion failed due to requirement 'is_same_v' 4232 | static_assert(is_same_v<_Tp, __int_for_sizeof_t<_Tp>>); | ^~~ /opt/compiler-explorer/gcc-14.2.0/lib/gcc/x86_64-linux-gnu/14.2.0/../../../../include/c++/14.2.0/experimental/bits/simd_x86.h:2244:26: note: in instantiation of function template specialization 'std::experimental::_MaskImplX86Mixin::_S_to_bits' requested here 2244 | return _MaskImpl::_S_to_bits( | ^ /opt/compiler-explorer/gcc-14.2.0/lib/gcc/x86_64-linux-gnu/14.2.0/../../../../include/c++/14.2.0/experimental/bits/simd.h:5625:40: note: in instantiation of function template specialization 'std::experimental::_SimdImplX86>::_S_equal_to' requested here 5625 | { return simd::_S_make_mask(_Impl::_S_equal_to(__x._M_data, __y._M_data)); } |^ :10:14: note: in instantiation of member function 'std::experimental::operator==' requested here 10 | return e == 0; | ^ 1 error generated. Compiler returned: 1 ```
[Bug libstdc++/118546] std::experimental::simd operator== fails to compile with clang++ 19.1.0 on x86-64-v4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118546 Imple Lee changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #1 from Imple Lee --- Probably a clang++ bug instead of libstdc++ bug. Reported as https://github.com/llvm/llvm-project/issues/132604 .