[PATCH] libstdc++: Use canonical loop form in std::reduce

2025-01-31 Thread Abhishek Kaushik
>From 4ac7c7e56e23ed2f4dd2dafdfab6cfa110c14260 Mon Sep 17 00:00:00 2001
From: Abhishek Kaushik 
Date: Fri, 31 Jan 2025 01:28:48 -0800
Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce

The current while loop in std::reduce and related functions is hard to
vectorize because the loop control variable is hard to detect.

`while ((__last - __first) >= 4)`

Changing the loop header to a for loop following the OpenMP canonical
form allows easy vectorization, resulting in improved performance.

`for (; __first <= __last - 4; __first += 4)`

This patch modifies the loop header for std::reduce & std::transform_reduce.
---
 libstdc++-v3/include/std/numeric | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
index 4d36fcd36d9..9c38ad89e21 100644
--- a/libstdc++-v3/include/std/numeric
+++ b/libstdc++-v3/include/std/numeric
@@ -300,13 +300,12 @@ namespace __detail
   static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, __ref>);
   if constexpr (__is_random_access_iter<_InputIterator>::value)
{
- while ((__last - __first) >= 4)
+ for (; __first <= __last - 4; __first += 4)
{
  _Tp __v1 = __binary_op(__first[0], __first[1]);
  _Tp __v2 = __binary_op(__first[2], __first[3]);
  _Tp __v3 = __binary_op(__v1, __v2);
  __init = __binary_op(__init, __v3);
- __first += 4;
}
}
   for (; __first != __last; ++__first)
@@ -381,7 +380,7 @@ namespace __detail
   if constexpr (__and_v<__is_random_access_iter<_InputIterator1>,
__is_random_access_iter<_InputIterator2>>)
{
- while ((__last1 - __first1) >= 4)
+ for (; __first1 <= __last1 - 4; __first1 += 4, __first2 += 4)
{
  _Tp __v1 = __binary_op1(__binary_op2(__first1[0], __first2[0]),
  __binary_op2(__first1[1], __first2[1]));
@@ -389,8 +388,6 @@ namespace __detail
  __binary_op2(__first1[3], __first2[3]));
  _Tp __v3 = __binary_op1(__v1, __v2);
  __init = __binary_op1(__init, __v3);
- __first1 += 4;
- __first2 += 4;
}
}
   for (; __first1 != __last1; ++__first1, (void) ++__first2)
@@ -447,7 +444,7 @@ namespace __detail
 {
   if constexpr (__is_random_access_iter<_InputIterator>::value)
{
- while ((__last - __first) >= 4)
+ for (; __first <= __last - 4; __first += 4)
{
  _Tp __v1 = __binary_op(__unary_op(__first[0]),
 __unary_op(__first[1]));
@@ -455,7 +452,6 @@ namespace __detail
 __unary_op(__first[3]));
  _Tp __v3 = __binary_op(__v1, __v2);
  __init = __binary_op(__init, __v3);
- __first += 4;
}
}
   for (; __first != __last; ++__first)
--
2.31.1






Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

2025-01-31 Thread Abhishek Kaushik
Sorry for the confusion, the change is for the intel compiler which is not able 
to vectorize correctly the while loop.
I'll change the commit message to show this clearly.

But it looks like the change still might be beneficial to g++: 
https://godbolt.org/z/Mo3PdxbaY



From: Jonathan Wakely 
Sent: Friday, January 31, 2025 7:19 PM
To: Richard Biener 
Cc: Abhishek Kaushik ; libstd...@gcc.gnu.org 
; gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

On Fri, 31 Jan 2025 at 12:48, Richard Biener  wrote:
>
> On Fri, Jan 31, 2025 at 12:01 PM Abhishek Kaushik
>  wrote:
> >
> > From 4ac7c7e56e23ed2f4dd2dafdfab6cfa110c14260 Mon Sep 17 00:00:00 2001
> > From: Abhishek Kaushik 
> > Date: Fri, 31 Jan 2025 01:28:48 -0800
> > Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce
> >
> > The current while loop in std::reduce and related functions is hard to
> > vectorize because the loop control variable is hard to detect.
> >
> > `while ((__last - __first) >= 4)`
> >
> > Changing the loop header to a for loop following the OpenMP canonical
> > form allows easy vectorization, resulting in improved performance.
> >
> > `for (; __first <= __last - 4; __first += 4)`
> >
> > This patch modifies the loop header for std::reduce & std::transform_reduce.
>
> Can you add a testcase to g++.dg/vect/ that is now vectorized but not before?

According to https://gcc.gnu.org/pipermail/libstdc++/2025-January/060353.html
this is only a problem for the Intel compiler, not for GCC. So a GCC
testcase doesn't help.

But if it's only for Intel, then the commit msg should say that.


>
> Thanks,
> Richard.
>
> > ---
> >  libstdc++-v3/include/std/numeric | 10 +++---
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/libstdc++-v3/include/std/numeric 
> > b/libstdc++-v3/include/std/numeric
> > index 4d36fcd36d9..9c38ad89e21 100644
> > --- a/libstdc++-v3/include/std/numeric
> > +++ b/libstdc++-v3/include/std/numeric
> > @@ -300,13 +300,12 @@ namespace __detail
> >static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, 
> > __ref>);
> >if constexpr (__is_random_access_iter<_InputIterator>::value)
> > {
> > - while ((__last - __first) >= 4)
> > + for (; __first <= __last - 4; __first += 4)
> > {
> >   _Tp __v1 = __binary_op(__first[0], __first[1]);
> >   _Tp __v2 = __binary_op(__first[2], __first[3]);
> >   _Tp __v3 = __binary_op(__v1, __v2);
> >   __init = __binary_op(__init, __v3);
> > - __first += 4;
> > }
> > }
> >for (; __first != __last; ++__first)
> > @@ -381,7 +380,7 @@ namespace __detail
> >if constexpr (__and_v<__is_random_access_iter<_InputIterator1>,
> > __is_random_access_iter<_InputIterator2>>)
> > {
> > - while ((__last1 - __first1) >= 4)
> > + for (; __first1 <= __last1 - 4; __first1 += 4, __first2 += 4)
> > {
> >   _Tp __v1 = __binary_op1(__binary_op2(__first1[0], __first2[0]),
> >   __binary_op2(__first1[1], __first2[1]));
> > @@ -389,8 +388,6 @@ namespace __detail
> >   __binary_op2(__first1[3], __first2[3]));
> >   _Tp __v3 = __binary_op1(__v1, __v2);
> >   __init = __binary_op1(__init, __v3);
> > - __first1 += 4;
> > - __first2 += 4;
> > }
> > }
> >for (; __first1 != __last1; ++__first1, (void) ++__first2)
> > @@ -447,7 +444,7 @@ namespace __detail
> >  {
> >if constexpr (__is_random_access_iter<_InputIterator>::value)
> > {
> > - while ((__last - __first) >= 4)
> > + for (; __first <= __last - 4; __first += 4)
> > {
> >   _Tp __v1 = __binary_op(__unary_op(__first[0]),
> >  __unary_op(__first[1]));
> > @@ -455,7 +452,6 @@ namespace __detail
> >  __unary_op(__first[3]));
> >   _Tp __v3 = __binary_op(__v1, __v2);
> >   __init = __binary_op(__init, __v3);
> > - __first += 4;
> > }
> > }
> >for (; __first != __last; ++__first)
> > --
> > 2.31.1
> >
> >
> >
> >
>



[PATCH] libstdc++: Use canonical loop form in std::reduce

2025-01-31 Thread Abhishek Kaushik
>From 7a7c9a2a976fbb29f67c46284e7c1581cbe8cb07 Mon Sep 17 00:00:00 2001
From: Abhishek Kaushik 
Date: Fri, 31 Jan 2025 01:28:48 -0800
Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce

This change is for the INTEL C compiler (icx).

The current while loop in std::reduce and related functions is hard to
vectorize because the loop control variable is hard to detect in icx.

`while ((__last - __first) >= 4)`

Changing the loop header to a for loop following the OpenMP canonical
form allows easy vectorization, resulting in improved performance.

`for (; __first <= __last - 4; __first += 4)`

This patch modifies the loop header for std::reduce & std::transform_reduce.
---
 libstdc++-v3/include/std/numeric | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
index 4d36fcd36d9..9c38ad89e21 100644
--- a/libstdc++-v3/include/std/numeric
+++ b/libstdc++-v3/include/std/numeric
@@ -300,13 +300,12 @@ namespace __detail
   static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, __ref>);
   if constexpr (__is_random_access_iter<_InputIterator>::value)
{
- while ((__last - __first) >= 4)
+ for (; __first <= __last - 4; __first += 4)
{
  _Tp __v1 = __binary_op(__first[0], __first[1]);
  _Tp __v2 = __binary_op(__first[2], __first[3]);
  _Tp __v3 = __binary_op(__v1, __v2);
  __init = __binary_op(__init, __v3);
- __first += 4;
}
}
   for (; __first != __last; ++__first)
@@ -381,7 +380,7 @@ namespace __detail
   if constexpr (__and_v<__is_random_access_iter<_InputIterator1>,
__is_random_access_iter<_InputIterator2>>)
{
- while ((__last1 - __first1) >= 4)
+ for (; __first1 <= __last1 - 4; __first1 += 4, __first2 += 4)
{
  _Tp __v1 = __binary_op1(__binary_op2(__first1[0], __first2[0]),
  __binary_op2(__first1[1], __first2[1]));
@@ -389,8 +388,6 @@ namespace __detail
  __binary_op2(__first1[3], __first2[3]));
  _Tp __v3 = __binary_op1(__v1, __v2);
  __init = __binary_op1(__init, __v3);
- __first1 += 4;
- __first2 += 4;
}
}
   for (; __first1 != __last1; ++__first1, (void) ++__first2)
@@ -447,7 +444,7 @@ namespace __detail
 {
   if constexpr (__is_random_access_iter<_InputIterator>::value)
{
- while ((__last - __first) >= 4)
+ for (; __first <= __last - 4; __first += 4)
{
  _Tp __v1 = __binary_op(__unary_op(__first[0]),
 __unary_op(__first[1]));
@@ -455,7 +452,6 @@ namespace __detail
 __unary_op(__first[3]));
  _Tp __v3 = __binary_op(__v1, __v2);
  __init = __binary_op(__init, __v3);
- __first += 4;
}
}
   for (; __first != __last; ++__first)
--
2.31.1





From: Abhishek Kaushik
Sent: Friday, January 31, 2025 4:28 PM
To: libstd...@gcc.gnu.org 
Cc: gcc-patches@gcc.gnu.org 
Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce

>From 4ac7c7e56e23ed2f4dd2dafdfab6cfa110c14260 Mon Sep 17 00:00:00 2001
From: Abhishek Kaushik 
Date: Fri, 31 Jan 2025 01:28:48 -0800
Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce

The current while loop in std::reduce and related functions is hard to
vectorize because the loop control variable is hard to detect.

`while ((__last - __first) >= 4)`

Changing the loop header to a for loop following the OpenMP canonical
form allows easy vectorization, resulting in improved performance.

`for (; __first <= __last - 4; __first += 4)`

This patch modifies the loop header for std::reduce & std::transform_reduce.
---
 libstdc++-v3/include/std/numeric | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
index 4d36fcd36d9..9c38ad89e21 100644
--- a/libstdc++-v3/include/std/numeric
+++ b/libstdc++-v3/include/std/numeric
@@ -300,13 +300,12 @@ namespace __detail
   static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, __ref>);
   if constexpr (__is_random_access_iter<_InputIterator>::value)
{
- while ((__last - __first) >= 4)
+ for (; __first <= __last - 4; __first += 4)
{
  _Tp __v1 = __binary_op(__first[0], __first[1]);
  _Tp __v2 = __binary_op(__first[2], __first[3]);
  _Tp __v3 = __binary_op(__v1, __v2);
  __init = __binary_op(__init, __v3);
- __first += 4;
}
}
   for (; __first != __last; ++__first)
@@ -381,7 +380,7 @@ namespace __detail
   if constexpr (__and_v<__is_random_access_iter<_InputIterator1>,
__is_random_access_iter<_InputIterator2>>)
{
- while ((__last1 - __f

Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

2025-01-31 Thread Abhishek Kaushik
  *
ICX needs to be improved here

Yes, we're trying to fix this but I figure I could also try asking politely.

  *
a user could write such code himself.

But it still makes sense for std::reduce to be faster than a hand-written 
reduce because we assume that as users of stl :)


From: Richard Biener 
Sent: Friday, January 31, 2025 8:27 PM
To: Jonathan Wakely 
Cc: Abhishek Kaushik ; libstd...@gcc.gnu.org 
; gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

On Fri, Jan 31, 2025 at 2:50 PM Jonathan Wakely  wrote:
>
> On Fri, 31 Jan 2025 at 12:48, Richard Biener  
> wrote:
> >
> > On Fri, Jan 31, 2025 at 12:01 PM Abhishek Kaushik
> >  wrote:
> > >
> > > From 4ac7c7e56e23ed2f4dd2dafdfab6cfa110c14260 Mon Sep 17 00:00:00 2001
> > > From: Abhishek Kaushik 
> > > Date: Fri, 31 Jan 2025 01:28:48 -0800
> > > Subject: [PATCH] libstdc++: Use canonical loop form in std::reduce
> > >
> > > The current while loop in std::reduce and related functions is hard to
> > > vectorize because the loop control variable is hard to detect.
> > >
> > > `while ((__last - __first) >= 4)`
> > >
> > > Changing the loop header to a for loop following the OpenMP canonical
> > > form allows easy vectorization, resulting in improved performance.
> > >
> > > `for (; __first <= __last - 4; __first += 4)`
> > >
> > > This patch modifies the loop header for std::reduce & 
> > > std::transform_reduce.
> >
> > Can you add a testcase to g++.dg/vect/ that is now vectorized but not 
> > before?
>
> According to https://gcc.gnu.org/pipermail/libstdc++/2025-January/060353.html
> this is only a problem for the Intel compiler, not for GCC. So a GCC
> testcase doesn't help.
>
> But if it's only for Intel, then the commit msg should say that.

A testcase that GCC can vectorize the result is still appreciated
(unless we already
have one).  I do wonder why we need to fix our standard library of
course, I'd say
ICX needs to be improved here, a user could write such code himself.

Richard.

>
> >
> > Thanks,
> > Richard.
> >
> > > ---
> > >  libstdc++-v3/include/std/numeric | 10 +++---
> > >  1 file changed, 3 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/libstdc++-v3/include/std/numeric 
> > > b/libstdc++-v3/include/std/numeric
> > > index 4d36fcd36d9..9c38ad89e21 100644
> > > --- a/libstdc++-v3/include/std/numeric
> > > +++ b/libstdc++-v3/include/std/numeric
> > > @@ -300,13 +300,12 @@ namespace __detail
> > >static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, 
> > > __ref>);
> > >if constexpr (__is_random_access_iter<_InputIterator>::value)
> > > {
> > > - while ((__last - __first) >= 4)
> > > + for (; __first <= __last - 4; __first += 4)
> > > {
> > >   _Tp __v1 = __binary_op(__first[0], __first[1]);
> > >   _Tp __v2 = __binary_op(__first[2], __first[3]);
> > >   _Tp __v3 = __binary_op(__v1, __v2);
> > >   __init = __binary_op(__init, __v3);
> > > - __first += 4;
> > > }
> > > }
> > >for (; __first != __last; ++__first)
> > > @@ -381,7 +380,7 @@ namespace __detail
> > >if constexpr (__and_v<__is_random_access_iter<_InputIterator1>,
> > > __is_random_access_iter<_InputIterator2>>)
> > > {
> > > - while ((__last1 - __first1) >= 4)
> > > + for (; __first1 <= __last1 - 4; __first1 += 4, __first2 += 4)
> > > {
> > >   _Tp __v1 = __binary_op1(__binary_op2(__first1[0], __first2[0]),
> > >   __binary_op2(__first1[1], __first2[1]));
> > > @@ -389,8 +388,6 @@ namespace __detail
> > >   __binary_op2(__first1[3], __first2[3]));
> > >   _Tp __v3 = __binary_op1(__v1, __v2);
> > >   __init = __binary_op1(__init, __v3);
> > > - __first1 += 4;
> > > - __first2 += 4;
> > > }
> > > }
> > >for (; __first1 != __last1; ++__first1, (void) ++__first2)
> > > @@ -447,7 +444,7 @@ namespace __detail
> > >  {
> > >if constexpr (__is_random_access_iter<_InputIterator>::value)
> > > {
> > > - while ((__last - __first) >= 4)
> > > + for (; __first <= __last - 4; __first += 4)
> > > {
> > >   _Tp __v1 = __binary_op(__unary_op(__first[0]),
> > >  __unary_op(__first[1]));
> > > @@ -455,7 +452,6 @@ namespace __detail
> > >  __unary_op(__first[3]));
> > >   _Tp __v3 = __binary_op(__v1, __v2);
> > >   __init = __binary_op(__init, __v3);
> > > - __first += 4;
> > > }
> > > }
> > >for (; __first != __last; ++__first)
> > > --
> > > 2.31.1
> > >
> > >
> > >
> > >
> >
>


Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

2025-01-31 Thread Abhishek Kaushik
Can we do __first + 4 <= __last?


From: Jonathan Wakely 
Sent: Friday, January 31, 2025 8:28 PM
To: libstd...@gcc.gnu.org 
Cc: Abhishek Kaushik ; gcc-patches@gcc.gnu.org 

Subject: Re: [PATCH] libstdc++: Use canonical loop form in std::reduce

On Fri, 31 Jan 2025 at 14:47, Marc Glisse  wrote:
>
> On Fri, 31 Jan 2025, Abhishek Kaushik wrote:
>
> > The current while loop in std::reduce and related functions is hard to
> > vectorize because the loop control variable is hard to detect in icx.
> >
> > `while ((__last - __first) >= 4)`
> >
> > Changing the loop header to a for loop following the OpenMP canonical
> > form allows easy vectorization, resulting in improved performance.
> >
> > `for (; __first <= __last - 4; __first += 4)`
>
> Is that always legal? If the sequence has size 1, is __last - 4 well
> defined?

No. I thought of that and for some reason assumed that since we're
already doing (last - first) it's OK ... but that's nonsense.

We need to check the size before the stride=4 loop.