On 7/30/25 11:59, Tomasz Kaminski wrote:
On Wed, Jul 30, 2025 at 10:56 AM Luc Grosheintz <luc.groshei...@gmail.com> wrote:On 7/28/25 13:04, Tomasz Kaminski wrote:On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com> wrote:On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <luc.groshei...@gmail.com>wrote:On 7/28/25 08:02, Tomasz Kaminski wrote:On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <luc.groshei...@gmail.com>wrote:The methods layout_{left,right}::mapping::stride are defined as \prod_{i = 0}^r E[i] \prod_{i = r+1}^n E[i] This is computed as the product of a pre-comupted static product andtheproduct of the required dynamic extents. Disassembly shows that even for low-rank extents, i.e. rank == 1 and rank == 2, with at least one dynamic extent, the generated codeloadstwo values; and then runs the loop over at most one element, e.g. 220: 48 8b 0c f5 00 00 00 mov rcx,QWORD PTR [rsi*8+0x0] 227: 00 228: 48 8b 04 f5 00 00 00 mov rax,QWORD PTR [rsi*8+0x0] 22f: 00 230: 48 c1 e1 02 shl rcx,0x2 234: 74 1a je 250 <stride_left_d5+0x30> 236: 48 01 f9 add rcx,rdi 239: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 240: 48 63 17 movsxd rdx,DWORD PTR [rdi] 243: 48 83 c7 04 add rdi,0x4 247: 48 0f af c2 imul rax,rdx 24b: 48 39 f9 cmp rcx,rdi 24e: 75 f0 jne 240 <stride_left_d5+0x20> 250: c3 ret If there's no dynamic extents, it simply loads the precomputedproductof static extents. For rank == 1 the answer is constant `1`; for rank == 2 it's either 1orextents.extent(k), with k == 0 for layout_left and k == 1 for layout_right. Consider, using Ed = std::extents<int, dyn>; int stride_left_d(const std::layout_left::mapping<Ed>& m, size_tr){ return m.stride(r); } using E3d = std::extents<int, 3, dyn>; int stride_left_3d(const std::layout_left::mapping<E3d>& m,size_tr){ return m.stride(r); } using Ed5 = std::extents<int, dyn, 5>; int stride_left_d5(const std::layout_left::mapping<Ed5>& m,size_tr){ return m.stride(r); } The optimized code for these three cases is: 0000000000000060 <stride_left_d>: 60: b8 01 00 00 00 mov eax,0x1 65: c3 ret 0000000000000090 <stride_left_3d>: 90: 48 83 fe 01 cmp rsi,0x1 94: 19 c0 sbb eax,eax 96: 83 e0 fe and eax,0xfffffffe 99: 83 c0 03 add eax,0x3 9c: c3 ret 00000000000000a0 <stride_left_d5>: a0: b8 01 00 00 00 mov eax,0x1 a5: 48 85 f6 test rsi,rsi a8: 74 02 je ac <stride_left_d5+0xc> aa: 8b 07 mov eax,DWORD PTR [rdi] i ac: c3 ret For rank == 1 it simply returns 1 (as expected). For rank == 2, it either implements a branchless formula, or conditionally loads one value. In all cases involving a dynamic extent this seems like it's always doing clearly less work, both in terms of computation andloads.For rank == 2, it trades loading one value for a branchless sequenceoffour instructions that don't require loading any values.I will put this optimization into the __fwd_prod and __rev_pordfunctions,so it will be applied for all uses. This will also avoid us creatingthiscaching tables for such small ranks.The problem is that we don't have the same amount of information in the stride and __fwd_prod. The valid values for __r are 1, ..., rank; for __fwd_prod it's inclusive, while in stride it's exclusive.Therefore,we can do the optimization with one comparison less in stride than in __fwd_prod.Make sense, and I am OK having this optimization there. However, out of curiosity, if we put always_inline on the __fwd_prod, wouldn't compilerbeable to eliminate __rank == __i. Also once we have separate call to __size,wecould put assertions (just as a comment) that for __fwd_prod and __rev_prod__ris required to be smaller than rank().I think I would pursue that direction, and have similar if constexpr inthe__size function: __rank == 0 -> 1 __rank == 1 -> extent(0) __rank == 2 -> extent(0) * extent(1) __rank == rank_dynamic() -> __ext_prod with __sta_ext == 1 otherwise -> use __sta_extr.We know that __size is optimized very aggressively and effectively. On -O2 everything is inlined and no symbols are created, no static storage is needed to compute __size. This is regardless of rank.Are you referring to the final binary here, or does the same happen even if we compile an object file? I am also interested in impact of the linker.
The output below is for object files. The command would be: g++ size.cc -O2 -c size.o (but there's a lot more clutter to make it work with xg++).
The following: using E1 = std::extents<int, dyn>; int size1(const std::mdspan<double, E1>& md) { return md.size(); } using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>; int size2(const std::mdspan<double, E2>& md) { return md.size(); } using E3 = std::extents<int, 3, dyn>; int size3(const std::mdspan<double, E3>& md) { return md.size(); } when compiled with -O2 -c size.o $ nm size.o 0000000000000000 T size1 0000000000000010 T size2 0000000000000030 T size3 $ objdump -h size.o size.o: file format elf64-x86-64 Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000037 0000000000000000 0000000000000000 00000040 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .data 00000000 0000000000000000 0000000000000000 00000077 2**0 CONTENTS, ALLOC, LOAD, DATA 2 .bss 00000000 0000000000000000 0000000000000000 00000077 2**0 ALLOC 3 .comment 0000002b 0000000000000000 0000000000000000 00000077 2**0 CONTENTS, READONLY 4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 000000a2 2**0 CONTENTS, READONLY 5 .note.gnu.property 00000030 0000000000000000 0000000000000000 000000a8 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 6 .eh_frame 00000058 0000000000000000 0000000000000000 000000d8 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA $ objdump -d size.o 0000000000000000 <size1>: 0: 8b 07 mov (%rdi),%eax 2: c3 ret 3: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) a: 00 00 00 00 e: 66 90 xchg %ax,%ax 0000000000000010 <size2>: 10: 48 63 07 movslq (%rdi),%rax 13: 48 63 57 04 movslq 0x4(%rdi),%rdx 17: 48 0f af c2 imul %rdx,%rax 1b: 0f af 47 08 imul 0x8(%rdi),%eax 1f: 69 c0 83 04 00 00 imul $0x483,%eax,%eax 25: c3 ret 26: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) 2d: 00 00 00 0000000000000030 <size3>: 30: 48 63 07 movslq (%rdi),%rax 33: 8d 04 40 lea (%rax,%rax,2),%eax 36: c3 ret Hence, I'm not sure where I need to look to find the cost of not putting the constexpr-if in __size. If I can see why it's slower, I'm happy to add them. However, since I've looked and can't see a cost, I'm reluctant to add them because they require test coverage for each branch (otherwise they're never compiled and could contain complete non-sense).Indeed, form you checks it looks like if constexpr are not necessary at all, however I would like __size to still avoid referencing _Rev_prod/_Fwd_prod at all. As alternative, what you think about defining the size as follows: constexpr size_t __rank = _Extents::rank(); constexpr size_t __sta_prod =__static_extents_prod(__exts); return __extents_prod(__exts, __sta_prod, 0u, __rank); This will alleviate or my concerns about creating _Rev_prod/_Fws_prod symbols, and allows us to decuple __fwd_prod/_rev_prod functions from it, and give them stronger preconditions.
Yes, this is much more palatable, thank you :)
I purposefully checked mdspan::size (part of the previous commit); and on optimized builds it fully unrolls the loop and does everything automatically, meaning it doesn't need help and we're not repeating the same optimization several times. As for avoiding the tables for small ranks, we can refactor _RevProd as follows: template<size_t... _Extents> struct _RevProd { size_t _S_value(size_t i) { return _S_data[i]; } private: array _S_data = consteval ...; } and use partial specialization to eliminate _S_data.libstdc++-v3/ChangeLog: * include/std/mdspan (layout_left::mapping::stride):Optimizefor rank <= 2. (layout_right::mapping::stride): Ditto. Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com> --- libstdc++-v3/include/std/mdspan | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/libstdc++-v3/include/std/mdspan b/libstdc++-v3/include/std/mdspan index 06ccf3e3827..f288af96cdb 100644 --- a/libstdc++-v3/include/std/mdspan +++ b/libstdc++-v3/include/std/mdspan @@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION requires (extents_type::rank() > 0) { __glibcxx_assert(__i < extents_type::rank()); - return __mdspan::__fwd_prod(_M_extents, __i); + if constexpr (extents_type::rank() == 1) + return 1; + else if constexpr (extents_type::rank() == 2) + return __i == 0 ? 1 : _M_extents.extent(0); + else + return __mdspan::__fwd_prod(_M_extents, __i); } template<typename _OExtents> @@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION requires (extents_type::rank() > 0) { __glibcxx_assert(__i < extents_type::rank()); - return __mdspan::__rev_prod(_M_extents, __i); + if constexpr (extents_type::rank() == 1) + return 1; + else if constexpr (extents_type::rank() == 2) + return __i == 0 ? _M_extents.extent(1) : 1; + else + return __mdspan::__rev_prod(_M_extents, __i); } template<typename _OExtents> -- 2.50.0