On 8/5/25 14:55, Tomasz Kaminski wrote:
On Tue, Aug 5, 2025 at 2:49 PM Tomasz Kaminski <tkami...@redhat.com> wrote:
On Tue, Aug 5, 2025 at 2:14 PM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:
On 8/5/25 13:25, Tomasz Kaminski wrote:
On Tue, Aug 5, 2025 at 1:14 PM Luc Grosheintz <luc.groshei...@gmail.com
wrote:
On 8/5/25 10:16, Tomasz Kaminski wrote:
Hi,
I have posted v3 patches with changes I have made locally for first 6
patches, and I think this series
is ready to land, in addition to
https://gcc.gnu.org/pipermail/libstdc++/2025-July/062727.html, that I
already reviewed.
I will keep aligned accessor on top of it.
As the separate commit, we should update the __fwd_partial_prod and
__rev_partial_prod computation
to use fact that this is partial prod, and we have previous partial
product computed:
template<array _Extents>
constexpr auto __fwd_partial_prods = [] consteval
{
constexpr size_t __rank = _Extents.size();
std::array<size_t, __rank + 1> __ret;
__ret[0] = 1;
for(size_t __r = 1; __r < __rank + 1; ++__r)
if (_Extents[__r] != dynamic_extents)
__ret[__r] = __ret[0] * _Extents[__r];
return __ret;
}();
We are doing this at compile time, but this should help composition
speed.
I would then inline __static_prod loop into the __mdspan::__size
function
and remove that function
entirely.
I checked both approaches, and it didn't affect the compile time in
the slightest, so I decided it wasn't worth it; and went with the
less error prone solution. OTOH, you're right, knowingly leaving in
something that's accidentally quadratic is begging for some obscure
problem to arise. Moreover, measuring these can be deceptive, because
they have a tendency to be fast enough, right up to the point where it
becomes prohibitively expensive.
Note that we are talking about the compile-time computation, that is
done
by an interpreter. It will never be costfull at runtime. And remove the
unnecessary
element from __fwd_partial_prod that should be the size of __rank
instead
of __rank + 1.
Yes, but leaving in code that could bring the compiler / compilation
to a grinding halt doesn't seems smart either (in hindsight) :)
If you want I can fix this up in a separate patch.
I would be interested if we could reduce the __fwd/rev_partial_prod
sizes,
but not keep the outermost dimensions there,
i.e. adding a check for __r == 0 early in the __fwd_prod funciton.
constexpr size_t __rank = _Extents::rank();
constexpr auto& __sta_exts = __static_extents<_Extents>();
if constexpr (__rank == 1)
return 1;
// new if here, that is extracted from __rank == 2
else if (__r == 0)
return 1;
else if constexpr (__rank == 2)
return _exts.extent(0);
else if constexpr
(__all_dynamic(std::span(__sta_exts).first(__rank-1)))
return __extents_prod(__exts, 1, 0, __r);
else
{
size_t __sta_prod =
__fwd_partial_prods<__sta_exts>[__r-1];
//
size reduce by one here
return __extents_prod(__exts, __sta_prod, 0, __r);
}
To me it's not obvious how this one will go. Intuitively, this adds a
comparison and branching to avoid loading 8 bytes. Compared to the
other
size reductions, this has a lower upper limit as to how effective it's
going to be.
My idea here is that we will avoid storing 8 bytes per possible
combination
of extents values, to store a constant that always has a value of 1.
This seems wasteful to me. So, I would go for it if:
* produces the same code for rank == 0, as before
* does not prevent using vectorization for all dynamic extents with,
just
with
addition of single if.
Since you really don't like wasting space: in many cases the 7 high
bytes of size_t will be zero (or 0xff...ff). What's stopping us from
storing [3, 5, 7] in an array<uint8_t>? We need to be very careful
handling dynamic_extent; but I think it's only used in checks, e.g.
_Extent[i] == dynamic_extent
and that can be translated to
_Extent[i] == decltype(_Extent[i])(-1)
and we've saved 8x .rodata space =)
Actually, using the product of all static extents to determine the element
type
__fwd_partial_prod/__rev_partial_prod sounds like something we could
explore
if someone complains about the sizes. Same for the _S_dynamic_index_data.
As I was unclear, I agree that changing the code to remove 1 extra element
of the array
is premature, and we may have better avenues to reduce that.
So let's just clear up the __fwd_prod computations and remove unused
__rank+1
element.
Thank you for clarifying, I'll send the __rank + 1 mistake soon.
It's very tempting to think one could load all the static extents of
a 4D array in a single cache line =)
If we ever get to it, we can also strip _S_dynamic_index, because
we only need k[i] if i is a dynamic index. Hence, anything after the
last dynamic index can be eliminated. And if we're allowed to subtract
a constant, then we can also strip from the front.
I don't think I'll work on this, since I don't have the required test
cases to properly measure this to justify the change; and I don't trust
myself to make the right call without very exhaustive measurements. I
also don't have access to a suitably large variety of hardware.
Similarly for __rev_prod:
constexpr size_t __rank = _Extents::rank();
constexpr auto& __sta_exts = __static_extents<_Extents>();
if constexpr (__rank == 1)
return 1;
else if (__r == __rank - 1)
return 1;
else if constexpr (__rank == 2)
return __exts.extent(1);
else if constexpr
(__all_dynamic(std::span(__sta_exts).last(__rank-1)))
return __extents_prod(__exts, 1, __r + 1, __rank);
else
{
size_t __sta_prod =
__rev_partial_prods<__sta_exts>[__r-1];
//
size reduced by one here.
return __extents_prod(__exts, __sta_prod, __r + 1,
__rank);
}
Regards,
Tomasz
On Mon, Aug 4, 2025 at 7:51 PM Luc Grosheintz <
luc.groshei...@gmail.com>
wrote:
On 8/4/25 17:42, Tomasz Kaminski wrote:
On Mon, Aug 4, 2025 at 1:14 PM Tomasz Kaminski <tkami...@redhat.com
wrote:
On Mon, Aug 4, 2025 at 1:08 PM Luc Grosheintz <
luc.groshei...@gmail.com
wrote:
Hi Tomasz,
Thank you for the review! Sorry about the missing parens, even
after
"Ctrl+F"ing each of the emails I can't find the requires clause
(the
two I found both need the parens).
I was mentioning this one, but it is possible that I am wrong on
this
one,
I haven't yet checked it, as I was planning to do it during full
review.
template<array _Extents>
requires (__all_dynamic<_Extents>())
class _StaticExtents<_Extents>;
But regardless, I will make that change locally if it does work, so
you
do
not
need to update patch series for it.
Parenthesis seem to be indeed required here, so my comment was a red
herring.
I'm pretty sure that this is the reason why I wrongly concluded
that the pattern is `requires(cond)`, i.e. the parens are always
needed.
Do you think it's possible to merge this series before the other
two
outstanding patch series? There's some risk of collision; making
changes
to this patch series is much more time consuming than rebasing
(and
retesting the other two patch series).
This sounds very reasonable. I will prepare a stack of patches in
your
suggested
order locally, and push them once they are approved. But it still
make
Thank you,
Thank you for your continued contributions.
Luc
On 8/4/25 12:43, Tomasz Kaminski wrote:
Hi,
I this time I made a quick pass through all changes, before
commenting
on
first commits,
and they look solid to me, and I haven't noticed anything I would
like
to
change (except parentheses
around requires, but I will handle that locally). I will try to
do a
full
review during this week.
Regards,
Tomasz
On Sun, Aug 3, 2025 at 10:59 PM Luc Grosheintz <
luc.groshei...@gmail.com>
wrote:
The combined effect of this sequence of change is:
* a reduction in the number of template instantiations, by
- avoiding needless dependency of IndexType,
- special formulas for low-rank extents,
- special formulas for (nearly) fully dynamic extents.
* improved code quality, by
- precomputing partial products of the static extents,
- special cases for low-rank extents,
- rewriting the condition E[i] == dynamic_extent in a
more
optimizer friendly manner.
- effectively loop-unrolling extents::operator==.
While simplistic micro-benchmarking shows the effectiveness of
these
changes, likely the stronger argument is presented in each
commit:
a) each change removes needless complexity,
b) before/after examples of generated code show the
effectiveness.
Luc Grosheintz (8):
libstdc++: Reduce template instantiations in <mdspan>.
libstdc++: Precompute products of static extents.
libstdc++: Improve low-rank layout_{left,right}::stride.
libstdc++: Improve fully dynamic extents in mdspan.
libstdc++: Improve nearly fully dynamic extents in mdspan.
libstdc++: Reduce indirection in extents::extent.
libstdc++: Improve extents::operator==.
libstdc++: Replace numeric_limit with __int_traits in
mdspan.
libstdc++-v3/include/std/mdspan | 282
+++++++++++++-----
.../mdspan/extents/class_mandates_neg.cc | 3 +
2 files changed, 208 insertions(+), 77 deletions(-)
--
2.50.0