Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Luc Grosheintz Wed, 30 Jul 2025 04:03:41 -0700



On 7/30/25 11:59, Tomasz Kaminski wrote:

On Wed, Jul 30, 2025 at 10:56 AM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:



On 7/28/25 13:04, Tomasz Kaminski wrote:

On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com>
wrote:




On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <

luc.groshei...@gmail.com>

wrote:



On 7/28/25 08:02, Tomasz Kaminski wrote:

On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <

luc.groshei...@gmail.com>

wrote:

The methods layout_{left,right}::mapping::stride are defined
as

     \prod_{i = 0}^r E[i]
     \prod_{i = r+1}^n E[i]

This is computed as the product of a pre-comupted static product and

the

product of the required dynamic extents.

Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one  dynamic extent, the generated code

loads

two values; and then runs the loop over at most one element, e.g.

    220:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
    227:  00
    228:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
    22f:  00
    230:  48 c1 e1 02            shl    rcx,0x2
    234:  74 1a                  je     250 <stride_left_d5+0x30>
    236:  48 01 f9               add    rcx,rdi
    239:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
    240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
    243:  48 83 c7 04            add    rdi,0x4
    247:  48 0f af c2            imul   rax,rdx
    24b:  48 39 f9               cmp    rcx,rdi
    24e:  75 f0                  jne    240 <stride_left_d5+0x20>
    250:  c3                     ret

If there's no dynamic extents, it simply loads the precomputed

product

of static extents.

For rank == 1 the answer is constant `1`; for rank == 2 it's either 1

or

extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.

Consider,

     using Ed = std::extents<int, dyn>;
     int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t

r)

     { return m.stride(r); }

     using E3d = std::extents<int, 3, dyn>;
     int stride_left_3d(const std::layout_left::mapping<E3d>& m,

size_t

r)

     { return m.stride(r); }

     using Ed5 = std::extents<int, dyn, 5>;
     int stride_left_d5(const std::layout_left::mapping<Ed5>& m,

size_t

r)

     { return m.stride(r); }

The optimized code for these three cases is:

     0000000000000060 <stride_left_d>:
     60:  b8 01 00 00 00         mov    eax,0x1
     65:  c3                     ret

     0000000000000090 <stride_left_3d>:
     90:  48 83 fe 01            cmp    rsi,0x1
     94:  19 c0                  sbb    eax,eax
     96:  83 e0 fe               and    eax,0xfffffffe
     99:  83 c0 03               add    eax,0x3
     9c:  c3                     ret

     00000000000000a0 <stride_left_d5>:
     a0:  b8 01 00 00 00         mov    eax,0x1
     a5:  48 85 f6               test   rsi,rsi
     a8:  74 02                  je     ac <stride_left_d5+0xc>
     aa:  8b 07                  mov    eax,DWORD PTR [rdi]
    i ac:  c3                     ret

For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and

loads.


For rank == 2, it trades loading one value for a branchless sequence

of

four instructions that don't require loading any values.

I will put this optimization into the __fwd_prod and __rev_pord

functions,

so it will be applied for all uses. This will also avoid us creating

this

caching
tables for such small ranks.


The problem is that we don't have the same amount of information in
the stride and __fwd_prod. The valid values for __r are 1, ..., rank;
for __fwd_prod it's inclusive, while in stride it's exclusive.

Therefore,

we can do the optimization with one comparison less in stride than in
__fwd_prod.

Make sense, and I am OK having this optimization there. However, out of
curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler

be

able
to eliminate __rank == __i. Also once we have separate call to __size,

we

could
put assertions (just as a comment) that for __fwd_prod and __rev_prod

__r

is
required to be smaller than rank().

I think I would pursue that direction, and have similar if constexpr in

the

__size function:
__rank == 0 -> 1
__rank == 1 -> extent(0)
__rank == 2 -> extent(0) * extent(1)
__rank == rank_dynamic() -> __ext_prod with __sta_ext == 1
otherwise -> use __sta_extr.


We know that __size is optimized very aggressively and effectively. On
-O2 everything is inlined and no symbols are created, no static storage
is needed to compute __size. This is regardless of rank.

Are you referring to the final binary here, or does the same happen even if
we compile an object file? I am also interested in impact of the linker.


The output below is for object files. The command would be:

  g++ size.cc -O2 -c size.o

(but there's a lot more clutter to make it work with xg++).


The following:

    using E1 = std::extents<int, dyn>;
    int size1(const std::mdspan<double, E1>& md)
    { return md.size(); }

    using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
    int size2(const std::mdspan<double, E2>& md)
    { return md.size(); }

    using E3 = std::extents<int, 3, dyn>;
    int size3(const std::mdspan<double, E3>& md)
    { return md.size(); }

when compiled with -O2 -c size.o

$ nm size.o
0000000000000000 T size1
0000000000000010 T size2
0000000000000030 T size3

$ objdump -h size.o
size.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off
Algn
    0 .text         00000037  0000000000000000  0000000000000000  00000040
2**4
                    CONTENTS, ALLOC, LOAD, READONLY, CODE
    1 .data         00000000  0000000000000000  0000000000000000  00000077
2**0
                    CONTENTS, ALLOC, LOAD, DATA
    2 .bss          00000000  0000000000000000  0000000000000000  00000077
2**0
                    ALLOC
    3 .comment      0000002b  0000000000000000  0000000000000000  00000077
2**0
                    CONTENTS, READONLY
    4 .note.GNU-stack 00000000  0000000000000000  0000000000000000
000000a2  2**0
                    CONTENTS, READONLY
    5 .note.gnu.property 00000030  0000000000000000  0000000000000000
000000a8  2**3
                    CONTENTS, ALLOC, LOAD, READONLY, DATA
    6 .eh_frame     00000058  0000000000000000  0000000000000000  000000d8
2**3
                    CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

$ objdump -d size.o
0000000000000000 <size1>:
     0:  8b 07                   mov    (%rdi),%eax
     2:  c3                      ret
     3:  66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
     a:  00 00 00 00
     e:  66 90                   xchg   %ax,%ax

0000000000000010 <size2>:
    10:  48 63 07                movslq (%rdi),%rax
    13:  48 63 57 04             movslq 0x4(%rdi),%rdx
    17:  48 0f af c2             imul   %rdx,%rax
    1b:  0f af 47 08             imul   0x8(%rdi),%eax
    1f:  69 c0 83 04 00 00       imul   $0x483,%eax,%eax
    25:  c3                      ret
    26:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    2d:  00 00 00

0000000000000030 <size3>:
    30:  48 63 07                movslq (%rdi),%rax
    33:  8d 04 40                lea    (%rax,%rax,2),%eax
    36:  c3                      ret

Hence, I'm not sure where I need to look to find the cost of not putting
the constexpr-if in __size. If I can see why it's slower, I'm happy to
add them. However, since I've looked and can't see a cost, I'm reluctant
to add them because they require test coverage for each branch (otherwise
they're never compiled and could contain complete non-sense).

Indeed, form you checks it looks like if constexpr are not necessary at all,
however I would like __size to still avoid referencing _Rev_prod/_Fwd_prod
at all.
As alternative, what you think about defining the size as follows:
      constexpr size_t __rank = _Extents::rank();
      constexpr size_t __sta_prod =__static_extents_prod(__exts);
      return __extents_prod(__exts, __sta_prod, 0u, __rank);
This will alleviate or my concerns about creating _Rev_prod/_Fws_prod
symbols,
and allows us to decuple __fwd_prod/_rev_prod functions from it, and give
them
stronger preconditions.


Yes, this is much more palatable, thank you :)

I purposefully checked mdspan::size (part of the previous commit); and
on optimized builds it fully unrolls the loop and does everything
automatically, meaning it doesn't need help and we're not repeating
the same optimization several times.

As for avoiding the tables for small ranks, we can refactor _RevProd
as follows:

     template<size_t... _Extents>
       struct _RevProd
       {
          size_t _S_value(size_t i)
          { return _S_data[i]; }

       private:
          array _S_data = consteval ...;
       }

and use partial specialization to eliminate _S_data.


libstdc++-v3/ChangeLog:

           * include/std/mdspan (layout_left::mapping::stride):

Optimize

           for rank <= 2.
           (layout_right::mapping::stride): Ditto.

Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
---
    libstdc++-v3/include/std/mdspan | 14 ++++++++++++--
    1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/mdspan
b/libstdc++-v3/include/std/mdspan
index 06ccf3e3827..f288af96cdb 100644
--- a/libstdc++-v3/include/std/mdspan
+++ b/libstdc++-v3/include/std/mdspan
@@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          requires (extents_type::rank() > 0)
          {
           __glibcxx_assert(__i < extents_type::rank());
-       return __mdspan::__fwd_prod(_M_extents, __i);
+       if constexpr (extents_type::rank() == 1)
+         return 1;
+       else if constexpr (extents_type::rank() == 2)
+         return __i == 0 ? 1 : _M_extents.extent(0);
+       else
+         return __mdspan::__fwd_prod(_M_extents, __i);
          }

          template<typename _OExtents>
@@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          requires (extents_type::rank() > 0)
          {
           __glibcxx_assert(__i < extents_type::rank());
-       return __mdspan::__rev_prod(_M_extents, __i);
+       if constexpr (extents_type::rank() == 1)
+         return 1;
+       else if constexpr (extents_type::rank() == 2)
+         return __i == 0 ? _M_extents.extent(1) : 1;
+       else
+         return __mdspan::__rev_prod(_M_extents, __i);
          }

          template<typename _OExtents>
--
2.50.0

Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Reply via email to