On 11/12/25 09:28, Richard Biener wrote:
On Wed, 12 Nov 2025, Richard Biener wrote:

On Tue, 11 Nov 2025, Victor Do Nascimento wrote:

Sorry about delayed reply,I've been obsessing on trying to solve one
last snag in the peeling for alignment implementation.

The reduction work was done trying to fix a regression I'd introduced in
the libstdc++-v3 unit tests during the course of the implementing of
these patches.

I reduced the unit test in question and I'm attaching that at the end of this
message.

Ironically, in writing up different testcases in plain C for
gcc.dg/vect, I've since discovered I fail to vectorize far simpler reduction
cases.

For the trivial case of

   sum = 0;
   while (1)
     {
       if (a[i] == 0) break;
       sum += a[i];
       i++;
     }

we get the following CFG:

   <bb 2> [local count: 118111600]:
   _17 = *a_8(D);
   if (_17 == 0)
     goto <bb 7>; [11.00%]
   else
     goto <bb 5>; [89.00%]

   <bb 5> [local count: 105119324]:

   <bb 3> [local count: 955630224]:
   # _18 =
   # sum_19 = PHI <sum_10(6), sum_7(D)(5)>
   # i_21 = PHI <i_11(6), 0(5)>
   sum_10 = _18 + sum_19;
   i_11 = i_21 + 1;
   _1 = (long unsigned int) i_11;
   _2 = _1 * 4;
   _3 = a_8(D) + _2;
   _4 = *_3;
   if (_4 == 0)
     goto <bb 8>; [11.00%]
   else
     goto <bb 6>; [89.00%]

and the vectorizer doesn't quite know how to handle the PHI <_4(6), _17(5)>,
categorized as `vect_unknown_def_type', so I'll figure out what to do about
that.

Generally for reductions we do not support an intermediate (aka _18
here) result to be live because as reductions are re-associated we
cannot compute this specific value.  Now, specifically _this_ value
we _can_ compute, so the restriction is a bit over-eager.

But it's not sth to solve as part of this series.  Would be interesting
to track in a bug though.

Now for the promised testcase...

#include <numeric>
#include <iterator>
#include <cassert>

int a[]  = {4, 5, 6, 7, 8, 9, 10, 11};
double b[] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
int N = 8;

template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
_Tp
transform_reduce_a(_InputIterator1 a0, _InputIterator1 aN,
                  _InputIterator2 b0, _Tp accum)
{
   while ((aN - a0) >= 4)

So we fail to analyze this as a counted loop?

Indeed.  But not sure why.  We have

   # a0_46 = PHI <a0_26(11), &a(2)>
...
   a0_26 = a0_46 + 16;
   _28 = &MEM <int[8]> [(void *)&a + 32B] - a0_26;
   if (_28 > 12)
     goto <bb 11>; [89.00%]

as IV.  I'll check.

It works on the 2nd loop but we fail vectorization because the reduction
has a float accumulator but adds double values.  This is currently
not a supported "conversion" around reductions, we only support
sign-conversions here.

So I'm not sure how this is a testcase for the fold-left reduction
as we do not even classify it as reduction?

With all double we vectorize the 2nd loop.

Right.

I reproduced the error that led to the initial development of this
patch, having rolled back the changes in the patch series.

It can be found in `26_numerics/transform_reduce/1.cc'.

It seems that on close inspection I pointed the two of you to the wrong
reduced test, having mixed up the issues I had to fix in the development
of this patch series.

Pleas do accept my sincerest apologies for the faux pas.

Prior to some of the work that went into patch no. 9, which correctly
fixed uncounted PHI handling within one single place in the codebase, I
found myself trying to fix different issues for different PHI nodes and
for the FOLD_LEFT_REDUCTION reduction_type, this was the inelegant
solution that fixed the initial testsuite regression.

It had escaped me that with the special uncounted loop handling that was introduced in `slpeel_tree_duplicate_loop_to_edge_cfg' this change,
effectively a hack, had become obsolete.

Therefore, I'm happy to drop this change from the series, being that it
should be wholly unnecessary.

Many thanks,
Victor



     {
       _Tp __v1 = (a0[0] * b0[0]) + (a0[1] * b0[1]);
       _Tp __v2 = (a0[2] * b0[2]) + (a0[3] * b0[3]);
       _Tp __v3 = (__v1 + __v2);
       accum = (accum + __v3);
       a0 += 4;
       b0 += 4;
     }
   for (; a0 != aN; ++a0, (void) ++b0)
     accum = (accum + (*a0 * *b0));
   return accum;
}

void
test01()
{
   auto res = transform_reduce_a(std::begin(a), std::end(a), std::begin(b),
                               std::move (1.0f));
   assert( res == (float)(1 + 30) );
}
int
main()
{
   test01();
}


Many thanks,
Victor

On 11/11/25 13:59, Richard Biener wrote:
On Tue, 11 Nov 2025, Tamar Christina wrote:

-----Original Message-----
From: Richard Biener <[email protected]>
Sent: 11 November 2025 12:59
To: Tamar Christina <[email protected]>
Cc: Victor Do Nascimento <[email protected]>; gcc-
[email protected]
Subject: RE: [PATCH 08/13] vect: Reclassify early break fold left
reductions as
simple reductions

On Tue, 11 Nov 2025, Tamar Christina wrote:

-----Original Message-----
From: Richard Biener <[email protected]>
Sent: 11 November 2025 12:16
To: Victor Do Nascimento <[email protected]>
Cc: [email protected]; Tamar Christina
<[email protected]>;
Victor Do Nascimento <[email protected]
1.compute.internal>
Subject: Re: [PATCH 08/13] vect: Reclassify early break fold left
reductions
as
simple reductions

On Mon, 10 Nov 2025, Victor Do Nascimento wrote:

From: Victor Do Nascimento <[email protected]
1.compute.internal>

This re-categorization of reductions for uncounted loops involving
reductions leads to the correct calling of
`vect_create_epilog_for_reduction' function.

gcc/ChangeLog:

  * tree-vect-loop.cc (vectorizable_reduction): Reclassify
  uncounted-loop VECT_REDUC_INFO_TYPE as
TREE_CODE_REDUCTION.
---
   gcc/tree-vect-loop.cc | 5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 901903cfbea..3b038169c95 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7426,8 +7426,9 @@ vectorizable_reduction (loop_vec_info
loop_vinfo,
                                 "supported.\n");
          return false;
            }
-         VECT_REDUC_INFO_TYPE (reduc_info)
-           = reduction_type = FOLD_LEFT_REDUCTION;
+         VECT_REDUC_INFO_TYPE (reduc_info) = reduction_type
+           = LOOP_VINFO_NITERS_UNCOUNTED_P (loop_vinfo) ?
TREE_CODE_REDUCTION
+           : FOLD_LEFT_REDUCTION;

I don't think this is correct.  We've arrived here with a
needs_fold_left_reduction_p check, if we cannot use a
FOLD_LEFT_REDUCTION
we have to fail.

That said, instead of vect_create_epilog_for_reduction this goes
through vectorize_fold_left_reduction which re-uses the original
scalar reduction PHI and thus any specific early-break handling would
need to go there.

I believe that if this is an issue with respect to re-starting then
that very same issue is present generally for early break vectorization.

Agree, I think vectorizable_reduction is missing support for reducing
from def 0.

Note that we mostly normally fail to analyse the reduction so we never
get here hence the missing support, so I'm somewhat surprised uncounted
loops did.

Is there a testcase that shows this?

Just add a FP reduction w/o -ffast-math to any existing early break
testcase?  You can simply reduce x += 5. or so I think, so no loads
necessary.

You mean like this? https://godbolt.org/z/4jbKx7j5a

At first glance that looks correct to me, the early exits use ret_12 and
the
main exit uses ret_6.

So that's handled correctly.

Indeed.  So I wonder what goes wrong in the uncounted case - and
possibly the peeled case with early break.

So I echo Tamar then, Victor, do you have a testcase that shows what
goes wrong?

Thanks,
Richard.







Reply via email to