On 7/28/25 13:04, Tomasz Kaminski wrote:
On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com>
wrote:
On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:
On 7/28/25 08:02, Tomasz Kaminski wrote:
On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <
luc.groshei...@gmail.com>
wrote:
The methods layout_{left,right}::mapping::stride are defined
as
\prod_{i = 0}^r E[i]
\prod_{i = r+1}^n E[i]
This is computed as the product of a pre-comupted static product and
the
product of the required dynamic extents.
Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one dynamic extent, the generated code loads
two values; and then runs the loop over at most one element, e.g.
220: 48 8b 0c f5 00 00 00 mov rcx,QWORD PTR [rsi*8+0x0]
227: 00
228: 48 8b 04 f5 00 00 00 mov rax,QWORD PTR [rsi*8+0x0]
22f: 00
230: 48 c1 e1 02 shl rcx,0x2
234: 74 1a je 250 <stride_left_d5+0x30>
236: 48 01 f9 add rcx,rdi
239: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
240: 48 63 17 movsxd rdx,DWORD PTR [rdi]
243: 48 83 c7 04 add rdi,0x4
247: 48 0f af c2 imul rax,rdx
24b: 48 39 f9 cmp rcx,rdi
24e: 75 f0 jne 240 <stride_left_d5+0x20>
250: c3 ret
If there's no dynamic extents, it simply loads the precomputed product
of static extents.
For rank == 1 the answer is constant `1`; for rank == 2 it's either 1
or
extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.
Consider,
using Ed = std::extents<int, dyn>;
int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
{ return m.stride(r); }
using E3d = std::extents<int, 3, dyn>;
int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t
r)
{ return m.stride(r); }
using Ed5 = std::extents<int, dyn, 5>;
int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t
r)
{ return m.stride(r); }
The optimized code for these three cases is:
0000000000000060 <stride_left_d>:
60: b8 01 00 00 00 mov eax,0x1
65: c3 ret
0000000000000090 <stride_left_3d>:
90: 48 83 fe 01 cmp rsi,0x1
94: 19 c0 sbb eax,eax
96: 83 e0 fe and eax,0xfffffffe
99: 83 c0 03 add eax,0x3
9c: c3 ret
00000000000000a0 <stride_left_d5>:
a0: b8 01 00 00 00 mov eax,0x1
a5: 48 85 f6 test rsi,rsi
a8: 74 02 je ac <stride_left_d5+0xc>
aa: 8b 07 mov eax,DWORD PTR [rdi]
i ac: c3 ret
For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and loads.
For rank == 2, it trades loading one value for a branchless sequence of
four instructions that don't require loading any values.
I will put this optimization into the __fwd_prod and __rev_pord
functions,
so it will be applied for all uses. This will also avoid us creating
this
caching
tables for such small ranks.
The problem is that we don't have the same amount of information in
the stride and __fwd_prod. The valid values for __r are 1, ..., rank;
for __fwd_prod it's inclusive, while in stride it's exclusive. Therefore,
we can do the optimization with one comparison less in stride than in
__fwd_prod.
Make sense, and I am OK having this optimization there. However, out of
curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler be
able
to eliminate __rank == __i. Also once we have separate call to __size, we
could
put assertions (just as a comment) that for __fwd_prod and __rev_prod __r
is
required to be smaller than rank().
I think I would pursue that direction, and have similar if constexpr in the
__size function:
__rank == 0 -> 1
__rank == 1 -> extent(0)
__rank == 2 -> extent(0) * extent(1)
__rank == rank_dynamic() -> __ext_prod with __sta_ext == 1
otherwise -> use __sta_extr.
We know that __size is optimized very aggressively and effectively. On
-O2 everything is inlined and no symbols are created, no static storage
is needed to compute __size. This is regardless of rank.
The following:
using E1 = std::extents<int, dyn>;
int size1(const std::mdspan<double, E1>& md)
{ return md.size(); }
using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
int size2(const std::mdspan<double, E2>& md)
{ return md.size(); }
using E3 = std::extents<int, 3, dyn>;
int size3(const std::mdspan<double, E3>& md)
{ return md.size(); }
when compiled with -O2 -c size.o
$ nm size.o
0000000000000000 T size1
0000000000000010 T size2
0000000000000030 T size3
$ objdump -h size.o
size.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000037 0000000000000000 0000000000000000 00000040 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00000000 0000000000000000 0000000000000000 00000077 2**0
CONTENTS, ALLOC, LOAD, DATA
2 .bss 00000000 0000000000000000 0000000000000000 00000077 2**0
ALLOC
3 .comment 0000002b 0000000000000000 0000000000000000 00000077 2**0
CONTENTS, READONLY
4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 000000a2 2**0
CONTENTS, READONLY
5 .note.gnu.property 00000030 0000000000000000 0000000000000000 000000a8
2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 .eh_frame 00000058 0000000000000000 0000000000000000 000000d8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
$ objdump -d size.o
0000000000000000 <size1>:
0: 8b 07 mov (%rdi),%eax
2: c3 ret
3: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
a: 00 00 00 00
e: 66 90 xchg %ax,%ax
0000000000000010 <size2>:
10: 48 63 07 movslq (%rdi),%rax
13: 48 63 57 04 movslq 0x4(%rdi),%rdx
17: 48 0f af c2 imul %rdx,%rax
1b: 0f af 47 08 imul 0x8(%rdi),%eax
1f: 69 c0 83 04 00 00 imul $0x483,%eax,%eax
25: c3 ret
26: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
2d: 00 00 00
0000000000000030 <size3>:
30: 48 63 07 movslq (%rdi),%rax
33: 8d 04 40 lea (%rax,%rax,2),%eax
36: c3 ret
Hence, I'm not sure where I need to look to find the cost of not putting
the constexpr-if in __size. If I can see why it's slower, I'm happy to
add them. However, since I've looked and can't see a cost, I'm reluctant
to add them because they require test coverage for each branch (otherwise
they're never compiled and could contain complete non-sense).
I purposefully checked mdspan::size (part of the previous commit); and
on optimized builds it fully unrolls the loop and does everything
automatically, meaning it doesn't need help and we're not repeating
the same optimization several times.
As for avoiding the tables for small ranks, we can refactor _RevProd
as follows:
template<size_t... _Extents>
struct _RevProd
{
size_t _S_value(size_t i)
{ return _S_data[i]; }
private:
array _S_data = consteval ...;
}
and use partial specialization to eliminate _S_data.
libstdc++-v3/ChangeLog:
* include/std/mdspan (layout_left::mapping::stride): Optimize
for rank <= 2.
(layout_right::mapping::stride): Ditto.
Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
---
libstdc++-v3/include/std/mdspan | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/libstdc++-v3/include/std/mdspan
b/libstdc++-v3/include/std/mdspan
index 06ccf3e3827..f288af96cdb 100644
--- a/libstdc++-v3/include/std/mdspan
+++ b/libstdc++-v3/include/std/mdspan
@@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
requires (extents_type::rank() > 0)
{
__glibcxx_assert(__i < extents_type::rank());
- return __mdspan::__fwd_prod(_M_extents, __i);
+ if constexpr (extents_type::rank() == 1)
+ return 1;
+ else if constexpr (extents_type::rank() == 2)
+ return __i == 0 ? 1 : _M_extents.extent(0);
+ else
+ return __mdspan::__fwd_prod(_M_extents, __i);
}
template<typename _OExtents>
@@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
requires (extents_type::rank() > 0)
{
__glibcxx_assert(__i < extents_type::rank());
- return __mdspan::__rev_prod(_M_extents, __i);
+ if constexpr (extents_type::rank() == 1)
+ return 1;
+ else if constexpr (extents_type::rank() == 2)
+ return __i == 0 ? _M_extents.extent(1) : 1;
+ else
+ return __mdspan::__rev_prod(_M_extents, __i);
}
template<typename _OExtents>
--
2.50.0