Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Luc Grosheintz Wed, 30 Jul 2025 01:56:55 -0700



On 7/28/25 13:04, Tomasz Kaminski wrote:

On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com>
wrote:




On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:



On 7/28/25 08:02, Tomasz Kaminski wrote:

On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <

luc.groshei...@gmail.com>

wrote:

The methods layout_{left,right}::mapping::stride are defined
as

    \prod_{i = 0}^r E[i]
    \prod_{i = r+1}^n E[i]

This is computed as the product of a pre-comupted static product and

the

product of the required dynamic extents.

Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one  dynamic extent, the generated code loads
two values; and then runs the loop over at most one element, e.g.

   220:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
   227:  00
   228:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
   22f:  00
   230:  48 c1 e1 02            shl    rcx,0x2
   234:  74 1a                  je     250 <stride_left_d5+0x30>
   236:  48 01 f9               add    rcx,rdi
   239:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
   240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
   243:  48 83 c7 04            add    rdi,0x4
   247:  48 0f af c2            imul   rax,rdx
   24b:  48 39 f9               cmp    rcx,rdi
   24e:  75 f0                  jne    240 <stride_left_d5+0x20>
   250:  c3                     ret

If there's no dynamic extents, it simply loads the precomputed product
of static extents.

For rank == 1 the answer is constant `1`; for rank == 2 it's either 1

or

extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.

Consider,

    using Ed = std::extents<int, dyn>;
    int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
    { return m.stride(r); }

    using E3d = std::extents<int, 3, dyn>;
    int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t

r)

    { return m.stride(r); }

    using Ed5 = std::extents<int, dyn, 5>;
    int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t

r)

    { return m.stride(r); }

The optimized code for these three cases is:

    0000000000000060 <stride_left_d>:
    60:  b8 01 00 00 00         mov    eax,0x1
    65:  c3                     ret

    0000000000000090 <stride_left_3d>:
    90:  48 83 fe 01            cmp    rsi,0x1
    94:  19 c0                  sbb    eax,eax
    96:  83 e0 fe               and    eax,0xfffffffe
    99:  83 c0 03               add    eax,0x3
    9c:  c3                     ret

    00000000000000a0 <stride_left_d5>:
    a0:  b8 01 00 00 00         mov    eax,0x1
    a5:  48 85 f6               test   rsi,rsi
    a8:  74 02                  je     ac <stride_left_d5+0xc>
    aa:  8b 07                  mov    eax,DWORD PTR [rdi]
   i ac:  c3                     ret

For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and loads.

For rank == 2, it trades loading one value for a branchless sequence of
four instructions that don't require loading any values.

I will put this optimization into the __fwd_prod and __rev_pord

functions,

so it will be applied for all uses. This will also avoid us creating

this

caching
tables for such small ranks.


The problem is that we don't have the same amount of information in
the stride and __fwd_prod. The valid values for __r are 1, ..., rank;
for __fwd_prod it's inclusive, while in stride it's exclusive. Therefore,
we can do the optimization with one comparison less in stride than in
__fwd_prod.

Make sense, and I am OK having this optimization there. However, out of
curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler be
able
to eliminate __rank == __i. Also once we have separate call to __size, we
could
put assertions (just as a comment) that for __fwd_prod and __rev_prod __r
is
required to be smaller than rank().

I think I would pursue that direction, and have similar if constexpr in the
__size function:
__rank == 0 -> 1
__rank == 1 -> extent(0)
__rank == 2 -> extent(0) * extent(1)
__rank == rank_dynamic() -> __ext_prod with __sta_ext == 1
otherwise -> use __sta_extr.


We know that __size is optimized very aggressively and effectively. On
-O2 everything is inlined and no symbols are created, no static storage
is needed to compute __size. This is regardless of rank.

The following:

  using E1 = std::extents<int, dyn>;
  int size1(const std::mdspan<double, E1>& md)
  { return md.size(); }

  using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
  int size2(const std::mdspan<double, E2>& md)
  { return md.size(); }

  using E3 = std::extents<int, 3, dyn>;
  int size3(const std::mdspan<double, E3>& md)
  { return md.size(); }

when compiled with -O2 -c size.o

$ nm size.o
0000000000000000 T size1
0000000000000010 T size2
0000000000000030 T size3

$ objdump -h size.o
size.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000037  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  0000000000000000  0000000000000000  00000077  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  0000000000000000  0000000000000000  00000077  2**0
                  ALLOC
  3 .comment      0000002b  0000000000000000  0000000000000000  00000077  2**0
                  CONTENTS, READONLY
  4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  000000a2  2**0
                  CONTENTS, READONLY
  5 .note.gnu.property 00000030  0000000000000000  0000000000000000  000000a8  
2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 .eh_frame     00000058  0000000000000000  0000000000000000  000000d8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

$ objdump -d size.o
0000000000000000 <size1>:
   0:   8b 07                   mov    (%rdi),%eax
   2:   c3                      ret
   3:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
   a:   00 00 00 00
   e:   66 90                   xchg   %ax,%ax

0000000000000010 <size2>:
  10:   48 63 07                movslq (%rdi),%rax
  13:   48 63 57 04             movslq 0x4(%rdi),%rdx
  17:   48 0f af c2             imul   %rdx,%rax
  1b:   0f af 47 08             imul   0x8(%rdi),%eax
  1f:   69 c0 83 04 00 00       imul   $0x483,%eax,%eax
  25:   c3                      ret
  26:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
  2d:   00 00 00

0000000000000030 <size3>:
  30:   48 63 07                movslq (%rdi),%rax
  33:   8d 04 40                lea    (%rax,%rax,2),%eax
  36:   c3                      ret

Hence, I'm not sure where I need to look to find the cost of not putting
the constexpr-if in __size. If I can see why it's slower, I'm happy to
add them. However, since I've looked and can't see a cost, I'm reluctant
to add them because they require test coverage for each branch (otherwise
they're never compiled and could contain complete non-sense).

I purposefully checked mdspan::size (part of the previous commit); and
on optimized builds it fully unrolls the loop and does everything
automatically, meaning it doesn't need help and we're not repeating
the same optimization several times.

As for avoiding the tables for small ranks, we can refactor _RevProd
as follows:

    template<size_t... _Extents>
      struct _RevProd
      {
         size_t _S_value(size_t i)
         { return _S_data[i]; }

      private:
         array _S_data = consteval ...;
      }

and use partial specialization to eliminate _S_data.


libstdc++-v3/ChangeLog:

          * include/std/mdspan (layout_left::mapping::stride): Optimize
          for rank <= 2.
          (layout_right::mapping::stride): Ditto.

Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
---
   libstdc++-v3/include/std/mdspan | 14 ++++++++++++--
   1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/mdspan
b/libstdc++-v3/include/std/mdspan
index 06ccf3e3827..f288af96cdb 100644
--- a/libstdc++-v3/include/std/mdspan
+++ b/libstdc++-v3/include/std/mdspan
@@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
         requires (extents_type::rank() > 0)
         {
          __glibcxx_assert(__i < extents_type::rank());
-       return __mdspan::__fwd_prod(_M_extents, __i);
+       if constexpr (extents_type::rank() == 1)
+         return 1;
+       else if constexpr (extents_type::rank() == 2)
+         return __i == 0 ? 1 : _M_extents.extent(0);
+       else
+         return __mdspan::__fwd_prod(_M_extents, __i);
         }

         template<typename _OExtents>
@@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
         requires (extents_type::rank() > 0)
         {
          __glibcxx_assert(__i < extents_type::rank());
-       return __mdspan::__rev_prod(_M_extents, __i);
+       if constexpr (extents_type::rank() == 1)
+         return 1;
+       else if constexpr (extents_type::rank() == 2)
+         return __i == 0 ? _M_extents.extent(1) : 1;
+       else
+         return __mdspan::__rev_prod(_M_extents, __i);
         }

         template<typename _OExtents>
--
2.50.0

Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Reply via email to