Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Luc Grosheintz Tue, 05 Aug 2025 06:13:37 -0700



On 8/5/25 14:55, Tomasz Kaminski wrote:

On Tue, Aug 5, 2025 at 2:49 PM Tomasz Kaminski <tkami...@redhat.com> wrote:



On Tue, Aug 5, 2025 at 2:14 PM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:



On 8/5/25 13:25, Tomasz Kaminski wrote:

On Tue, Aug 5, 2025 at 1:14 PM Luc Grosheintz <luc.groshei...@gmail.com

wrote:



On 8/5/25 10:16, Tomasz Kaminski wrote:

Hi,

I have posted v3 patches with changes I have made locally for first 6
patches, and I think this series
is ready to land, in addition to
https://gcc.gnu.org/pipermail/libstdc++/2025-July/062727.html, that I
already reviewed.
I will keep aligned accessor on top of it.

As the separate commit, we should update the __fwd_partial_prod and
__rev_partial_prod computation
to use fact that this is partial prod, and we have previous partial
product computed:
       template<array _Extents>
         constexpr auto __fwd_partial_prods = [] consteval
           {
             constexpr size_t __rank = _Extents.size();
             std::array<size_t, __rank + 1> __ret;
             __ret[0] = 1;
             for(size_t __r = 1; __r < __rank + 1; ++__r)
               if (_Extents[__r] != dynamic_extents)
                 __ret[__r] = __ret[0] * _Extents[__r];
             return __ret;
           }();
We are doing this at compile time, but this should help composition

speed.

I would then inline __static_prod loop into the __mdspan::__size

function

and remove that function
entirely.


I checked both approaches, and it didn't affect the compile time in
the slightest, so I decided it wasn't worth it; and went with the
less error prone solution. OTOH, you're right, knowingly leaving in
something that's accidentally quadratic is begging for some obscure
problem to arise. Moreover, measuring these can be deceptive, because
they have a tendency to be fast enough, right up to the point where it
becomes prohibitively expensive.

Note that we are talking about the compile-time computation, that  is

done

by an interpreter. It will never be costfull at runtime. And remove the
unnecessary
element from __fwd_partial_prod that should be the size of __rank

instead

of __rank + 1.


Yes, but leaving in code that could bring the compiler / compilation
to a grinding halt doesn't seems smart either (in hindsight) :)


If you want I can fix this up in a separate patch.



I would be interested if we could reduce the __fwd/rev_partial_prod

sizes,

but not keep the outermost dimensions there,
i.e. adding a check for __r == 0 early in the __fwd_prod funciton.
           constexpr size_t __rank = _Extents::rank();
           constexpr auto& __sta_exts = __static_extents<_Extents>();
           if constexpr (__rank == 1)
             return 1;
          // new if here, that is extracted from __rank == 2
           else if (__r == 0)
             return 1;
           else if constexpr (__rank == 2)
             return _exts.extent(0);
           else if constexpr
(__all_dynamic(std::span(__sta_exts).first(__rank-1)))
             return __extents_prod(__exts, 1, 0, __r);
           else
             {
               size_t __sta_prod =

__fwd_partial_prods<__sta_exts>[__r-1];

//

size reduce by one here
               return __extents_prod(__exts, __sta_prod, 0, __r);
             }


To me it's not obvious how this one will go. Intuitively, this adds a
comparison and branching to avoid loading 8 bytes. Compared to the

other

size reductions, this has a lower upper limit as to how effective it's
going to be.

My idea here is that we will avoid storing 8 bytes per possible

combination

of extents values, to store a constant that always has a value of 1.
This seems wasteful to me. So, I would go for it if:
* produces the same code for rank == 0, as before
* does not prevent using vectorization for all dynamic extents with,

just

with
    addition of single if.


Since you really don't like wasting space: in many cases the 7 high
bytes of size_t will be zero (or 0xff...ff). What's stopping us from
storing [3, 5, 7] in an array<uint8_t>? We need to be very careful
handling dynamic_extent; but I think it's only used in checks, e.g.

    _Extent[i] == dynamic_extent

and that can be translated to

    _Extent[i] == decltype(_Extent[i])(-1)

and we've saved 8x .rodata space =)

Actually, using the product of all static extents to determine the element
type
__fwd_partial_prod/__rev_partial_prod sounds like something we could
explore
if someone complains about the sizes. Same for the _S_dynamic_index_data.

As I was unclear, I agree that changing the code to remove 1 extra element
of the array
is premature, and we may have better avenues to reduce that.
So let's just clear up the __fwd_prod computations and remove unused
__rank+1
element.


Thank you for clarifying, I'll send the __rank + 1 mistake soon.

It's very tempting to think one could load all the static extents of
a 4D array in a single cache line =)

If we ever get to it, we can also strip _S_dynamic_index, because
we only need k[i] if i is a dynamic index. Hence, anything after the
last dynamic index can be eliminated. And if we're allowed to subtract
a constant, then we can also strip from the front.

I don't think I'll work on this, since I don't have the required test
cases to properly measure this to justify the change; and I don't trust
myself to make the right call without very exhaustive measurements. I
also don't have access to a suitably large variety of hardware.


Similarly for __rev_prod:
           constexpr size_t __rank = _Extents::rank();
           constexpr auto& __sta_exts = __static_extents<_Extents>();
           if constexpr (__rank == 1)
             return 1;
          else if (__r == __rank - 1)
             return 1;
           else if constexpr (__rank == 2)
             return __exts.extent(1);
           else if constexpr
(__all_dynamic(std::span(__sta_exts).last(__rank-1)))
             return __extents_prod(__exts, 1, __r + 1, __rank);
           else
             {
               size_t __sta_prod =

__rev_partial_prods<__sta_exts>[__r-1];

//

size reduced by one here.
               return __extents_prod(__exts, __sta_prod, __r + 1,

__rank);

             }

Regards,
Tomasz


On Mon, Aug 4, 2025 at 7:51 PM Luc Grosheintz <

luc.groshei...@gmail.com>

wrote:



On 8/4/25 17:42, Tomasz Kaminski wrote:

On Mon, Aug 4, 2025 at 1:14 PM Tomasz Kaminski <tkami...@redhat.com

wrote:



On Mon, Aug 4, 2025 at 1:08 PM Luc Grosheintz <

luc.groshei...@gmail.com

wrote:

Hi Tomasz,

Thank you for the review! Sorry about the missing parens, even

after

"Ctrl+F"ing each of the emails I can't find the requires clause

(the

two I found both need the parens).

I was mentioning this one, but it is possible that I am wrong on

this

one,

I haven't yet checked it, as I was planning to do it during full

review.

        template<array _Extents>
          requires (__all_dynamic<_Extents>())
          class _StaticExtents<_Extents>;
But regardless, I will make that change locally if it does work, so

you

do

not
need to update patch series for it.

Parenthesis seem to be indeed required here, so my comment was a red
herring.


I'm pretty sure that this is the reason why I wrongly concluded
that the pattern is `requires(cond)`, i.e. the parens are always
needed.

Do you think it's possible to merge this series before the other

two

outstanding patch series? There's some risk of collision; making

changes

to this patch series is much more time consuming than rebasing

(and

retesting the other two patch series).

This sounds very reasonable. I will prepare a stack of patches in

your

suggested
order locally, and push them once they are approved. But it still

make


Thank you,

Thank you for your continued contributions.

Luc

On 8/4/25 12:43, Tomasz Kaminski wrote:

Hi,

I this time I made a quick pass through all changes, before

commenting

on

first commits,
and they look solid to me, and I haven't noticed anything I would

like

to

change (except parentheses
around requires, but I will handle that locally). I will try to

do a

full

review during this week.

Regards,
Tomasz

On Sun, Aug 3, 2025 at 10:59 PM Luc Grosheintz <

luc.groshei...@gmail.com>

wrote:

The combined effect of this sequence of change is:

       * a reduction in the number of template instantiations, by
         - avoiding needless dependency of IndexType,
         - special formulas for low-rank extents,
         - special formulas for (nearly) fully dynamic extents.

       * improved code quality, by
         - precomputing partial products of the static extents,
         - special cases for low-rank extents,
         - rewriting the condition E[i] == dynamic_extent in a

more

           optimizer friendly manner.
         - effectively loop-unrolling extents::operator==.

While simplistic micro-benchmarking shows the effectiveness of

these

changes, likely the stronger argument is presented in each

commit:

       a) each change removes needless complexity,
       b) before/after examples of generated code show the

effectiveness.


Luc Grosheintz (8):
       libstdc++: Reduce template instantiations in <mdspan>.
       libstdc++: Precompute products of static extents.
       libstdc++: Improve low-rank layout_{left,right}::stride.
       libstdc++: Improve fully dynamic extents in mdspan.
       libstdc++: Improve nearly fully dynamic extents in mdspan.
       libstdc++: Reduce indirection in extents::extent.
       libstdc++: Improve extents::operator==.
       libstdc++: Replace numeric_limit with __int_traits in

mdspan.


      libstdc++-v3/include/std/mdspan               | 282

+++++++++++++-----

      .../mdspan/extents/class_mandates_neg.cc      |   3 +
      2 files changed, 208 insertions(+), 77 deletions(-)

--
2.50.0

Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Reply via email to