On Tue, 2 Sep 2025, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, September 2, 2025 3:08 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of
> > inclusive OR when reducing comparison values
> >
> > On Tue, 2 Sep 2025, Tamar Christina wrote:
> >
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Tuesday, September 2, 2025 1:30 PM
> > > > To: Tamar Christina <[email protected]>
> > > > Cc: [email protected]; nd <[email protected]>
> > > > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression
> > > > instead of
> > > > inclusive OR when reducing comparison values
> > > >
> > > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > > >
> > > > > Given a sequence such as
> > > > >
> > > > > int foo ()
> > > > > {
> > > > > #pragma GCC unroll 4
> > > > > for (int i = 0; i < N; i++)
> > > > > if (a[i] == 124)
> > > > > return 1;
> > > > >
> > > > > return 0;
> > > > > }
> > > > >
> > > > > where a[i] is long long, we will unroll the loop and use an OR
> > > > > reduction for
> > > > > early break on Adv. SIMD. Afterwards the sequence is followed by a
> > compression
> > > > > sequence to compress the 128-bit vectors into 64-bits for use by the
> > > > > branch.
> > > > >
> > > > > However if we have support for add halving and narrowing then we can
> > instead
> > > > of
> > > > > using an OR, use an ADDHN which will do the combining and narrowing.
> > > > >
> > > > > Note that for now I only do the last OR, however if we have more than
> > > > > one
> > level
> > > > > of unrolling we could technically chain them. I will revisit this in
> > > > > another
> > > > > up coming early break series, however an unroll of 2 is fairly common.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > -m32, -m64 and no issues and about a 10% improvements
> > > > > in this sequence for Adv. SIMD.
> > > > >
> > > > > Ok for master?
> > > >
> > > > Hmm, so you are replacing the last bitwise OR with a
> > > > addhn which produces a "smaller" vector. So like
> > > >
> > > > V4SI tem = V4SI | V4SI;
> > > > if (tem != 0)
> > > >
> > > > ->
> > > >
> > > > V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
> > > > if (tem != 0)
> > > >
> > > > whatever 'halving' now stands for (isn't that .VEC_ADD_HIGH_NARROW?)
> > > >
> > >
> > > Yeah, but it retrieved the high half, so open to suggestion but I think
> > > any name
> > > would be confusion..
> > >
> > > > I can't see how that's in any way faster? (the aarch64 testcases
> > > > unfortunately stop matching after the addhn)
> > > >
> > >
> > > Which is intentional.
> > >
> > > The original code with the ORR is
> > >
> > > ldp q31, q30, [x0]
> > > cmeq v31.2d, v31.2d, v29.2d
> > > cmeq v30.2d, v30.2d, v29.2d
> > > orr v31.16b, v31.16b, v30.16b
> > > umaxp v31.4s, v31.4s, v31.4s
> > > fmov x3, d31
> > >
> > > because the result of the ORR is a 128-bit vector it needs to be
> > > compressed
> > > into 64 bits to be transferred to GPR so the != 0 can be performed.
> > >
> > > ADDHN does the combination and compression in one step. i.e.
> > >
> > > Orr + umaxp -> addhn.
> >
> > Ah, I see. So AdvSIMD lacks a ptest, and instead you go to gpr.
> > The above code does a max reduction and the fmov moves the
> > scalar reduction result to a GPR? But with addhn you move
> > the whole (64bit) vector reg to a GPR?
> >
>
> Indeed.
>
> > It seems to me that on the vectorizer side it's not so interesting
> > to know the target can do addhn but that the target can't do
> > a {u,}cmpv4si with EQ? That is, without the patch the vectorizer
> > generates a GIMPLE_COND that isn't supported by the target?
> >
> > We check for cbranch, so currently you say you can do it but emulate
> > it with umaxp + fmov?
>
> Indeed.
>
> >
> > What do you do when there's only one V4SImode vector? You
> > could pack (truncate) that to V4HImode, right? Aka
> > vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is
> > then 'x'?
>
> For only one V4SImode vector we use UMAXP still but with itself,
> It essentially throws away half the lanes of the result, but that's ok
> Because for a cbranch all we care about is whether any is set or all
> is zero. We use UMAXP because that works for vectors of bytes too
> whereas narrowing wouldn't.
Ah, looked up umaxp and it's a concat + reduce adjacent lanes to
one with MAX. We don't have an optab scheme for such instruction
either ;) I think the closest are the SAD_EXPR likes, that
reduce the number of lanes because they are widening, but those
have one input only. umaxp is sth like a reduc_umax_evenodd,
one could imagine a reduc_umax_hilo that reduces { a0, a1, a2, a3 }
{ b0, b1, b2, b3 } as { umax (a0, b0), umax (a1, b1), umax (a2,b2),
umax (a3, b3) } instead. That said, I can see how addhn is
useful here.
> However when SVE is available (just wasn't chosen due to costing)
> the goal is to use the SVE comparison predicated to 128-bits instead.
>
> This is where my up coming patch for vec_cbranch_any and
> vec_cbranch_all comes in. With only one comparison we can replace
> it with an SVE compare + branch, removing the need for the reduction.
Hmm, it all feels like somewhat of a delicate target costing thing
to me.
> >
> > > > Also the inputs are vector bools(?), so you should V_C_E them to
> > > > data vectors before "adding" them. And check that they have
> > > > a vector mode that's not VnBImode for which I guess the addhn
> > > > semantics wouldn't be necessarily good enough.
> > >
> > > ADDHN can't be used for SVE (and so the optab isn't implemented on SVE
> > modes)
> > > because SVE's version is even/odd. But for SVE we also don't want this
> > > codegen
> > > because SVE can branch on the result of the data compare. So we don't
> > > want the
> > > intermediate forced compression. So this is strictly for Adv. SIMD.
> > >
> > > >
> > > > How would you scale this to workset.length () > 2? I suppose
> > > > for an even number reduce to the half element size first, for
> > > > odd you could make it even by first reducing two vectors with IOR?
> > > > If small, either check for another narrowing addhn operation or
> > > > continue with IOR?
> > > >
> > >
> > > Because the instruction can't work on bytes, having > 2 just uses ORR
> > > Until we have == 2 and then ADDHN. You could use ADDHN for the
> > > Intermediate steps, but ADDHN hits a limit when you reach bytes.
> > >
> > > However one big benefit of using the ADDHN even in > 2 cases is that
> > > it prevents reassoc from breaking the ORR order we created in the
> > > vectorizer as it can't reassociate them back to a linear form as it does
> > > today.
> > >
> > > And the reason we can't match the ADDHN in the backend is that in
> > > order for us to know that the inputs are Boolean vectors we also have
> > > to match the compares. This means that the chain is longer than what
> > > combine tries since it has to match everything including the if_then_else
> > > and the reduction to set the CC.
> > >
> > > > That said, I still fail to see how addhn reduces the critical
> > > > latency?
> > > >
> > >
> > > Because it replaces 2 instruction on the critical reduction path with 1
> > > that is half the latency of the two it replaced. See example above.
> >
> > So it's good enough to indeed combine the last two elements like your
> > patch does.
> >
> > That said, I still wonder about the trigger - it shouldn't be
> > availability of the instruction as I'd think a addhn should be
> > not cheaper than a simple bitwise OR. Instead it's that
> > cbranch on the wider vector isn't available?
>
> The addhn is actually the same latency/throughput as ORR as they're both
> simple
> vector ALU operations on all cores. But yeah the reason this is beneficial
> is because of the reduction. So from that point of view it is beneficial to
> always use if available.
Is it? At least only when cbranch on that smaller mode is available?
> The reason why I don't think the vectorizer should generate anything
> different than the cbranch is that as mentioned above when SVE is
> available (which is on all modern Arm produced cores) we can replace
> the cbranch with an SVE compare.
>
> So testing for cbranch is temporary until it's replaced with testing for
> Vec_cbranch_any and vec_cbranch_all (and deprecate cbranch) which
> allows us more flexibility in the back end.
Right.
Still an unconditional use of addhn when available looks wrong to me,
why'd we have the cbranch check then, anyway?
Richard.
> Thanks,
> Tamar
>
> >
> > Richard.
> >
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
> > > > > * doc/generic.texi: Document it.
> > > > > * optabs.def (vec_addh_narrow): New.
> > > > > * doc/md.texi: Document it.
> > > > > * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if
> > > > > supported.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
> > > > > * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
> > > > > * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
> > > > > * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.
> > > > >
> > > > > ---
> > > > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > > > index
> > > >
> > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9
> > > > a885d9d218d9af 100644
> > > > > --- a/gcc/doc/generic.texi
> > > > > +++ b/gcc/doc/generic.texi
> > > > > @@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the
> > > > third is an @code{INTEGER_CST}.
> > > > > @tindex IFN_VEC_WIDEN_MINUS_LO
> > > > > @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > > > > @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > > > +@tindex IFN_VEC_ADD_HALVING_NARROW
> > > > > @tindex VEC_UNPACK_HI_EXPR
> > > > > @tindex VEC_UNPACK_LO_EXPR
> > > > > @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions. In the case
> > of
> > > > > vector are subtracted from the odd @code{N/2} of the first to
> > > > > produce the
> > > > > vector of @code{N/2} subtractions.
> > > > >
> > > > > +@item IFN_VEC_ADD_HALVING_NARROW
> > > > > +This internal function performs an addition of two input vectors,
> > > > > +then extracts the most significant half of each result element and
> > > > > +narrows it back to the original element width.
> > > > > +
> > > > > +Concretely, it computes:
> > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > +
> > > > > +where @code{bits(a)} is the width in bits of each input element.
> > > > > +
> > > > > +Its operands are vectors containing the same number of elements
> > (@code{N})
> > > > > +of the same integral type. The result is a vector of length
> > > > > @code{N}, with
> > > > > +elements of an integral type whose size is half that of the input
> > > > > element
> > > > > +type.
> > > > > +
> > > > > +This operation currently only used for early break result
> > > > > compression when
> > the
> > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > +
> > > > > @item VEC_UNPACK_HI_EXPR
> > > > > @itemx VEC_UNPACK_LO_EXPR
> > > > > These nodes represent unpacking of the high and low parts of the
> > > > > input
> > vector,
> > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > index
> > > >
> > aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9
> > > > b28ba8d52e5d464 100644
> > > > > --- a/gcc/doc/md.texi
> > > > > +++ b/gcc/doc/md.texi
> > > > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size
> > > > S@. Find the absolute
> > > > > difference between operands 1 and 2 and widen the resulting elements.
> > > > > Put the N/2 results of size 2*S in the output vector (operand 0).
> > > > >
> > > > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> > > > > +@item @samp{vec_addh_narrow@var{m}}
> > > > > +Signed or unsigned addition of two input vectors, then extracts the
> > > > > +most significant half of each result element and narrows it back to
> > > > > the
> > > > > +original element width.
> > > > > +
> > > > > +Concretely, it computes:
> > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > +
> > > > > +where @code{bits(a)} is the width in bits of each input element.
> > > > > +
> > > > > +Its operands (@code{1} and @code{2}) are vectors containing the same
> > > > number
> > > > > +of signed or unsigned integral elements (@code{N}) of size @code{S}.
> > > > > The
> > > > > +result (operand @code{0}) is a vector of length @code{N}, with
> > > > > elements of
> > > > > +an integral type whose size is half that of @code{S}.
> > > > > +
> > > > > +This operation currently only used for early break result
> > > > > compression when
> > the
> > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > +
> > > > > @cindex @code{vec_addsub@var{m}3} instruction pattern
> > > > > @item @samp{vec_addsub@var{m}3}
> > > > > Alternating subtract, add with even lanes doing subtract and odd
> > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > > index
> > > >
> > d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b
> > > > 31d0abc9adb67867 100644
> > > > > --- a/gcc/internal-fn.def
> > > > > +++ b/gcc/internal-fn.def
> > > > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN
> > (COMPLEX_ADD_ROT270,
> > > > ECF_CONST, cadd270, binary)
> > > > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> > > > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj,
> > > > binary)
> > > > > DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
> > > > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST |
> > > > ECF_NOTHROW,
> > > > > + vec_addh_narrow, binary)
> > > > > DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
> > > > > ECF_CONST | ECF_NOTHROW,
> > > > > first,
> > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > index
> > > >
> > 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797
> > > > 2db81b542b32c9eb8 100644
> > > > > --- a/gcc/optabs.def
> > > > > +++ b/gcc/optabs.def
> > > > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > > > "vec_widen_uabd_hi_$a")
> > > > > OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > > > > OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > > > > OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> > > > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
> > > > > OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > > > > OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > > > > OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > > > diff --git
> > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > new file mode 100644
> > > > > index
> > > >
> > 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8
> > > > b063e418a75a23c525d
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > @@ -0,0 +1,33 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" }
> > > > > */
> > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > +
> > > > > +#define TYPE int
> > > > > +#define N 800
> > > > > +
> > > > > +#pragma GCC target "+nosve"
> > > > > +
> > > > > +TYPE a[N];
> > > > > +
> > > > > +/*
> > > > > +** foo:
> > > > > +** ...
> > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> > > > > +** fmov x[0-9]+, d[0-9]+
> > > > > +** ...
> > > > > +*/
> > > > > +
> > > > > +int foo ()
> > > > > +{
> > > > > +#pragma GCC unroll 8
> > > > > + for (int i = 0; i < N; i++)
> > > > > + if (a[i] == 124)
> > > > > + return 1;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > > > diff --git
> > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > new file mode 100644
> > > > > index
> > > >
> > 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80
> > > > 5e59188eb8155cb5f06
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > @@ -0,0 +1,33 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" }
> > > > > */
> > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > +
> > > > > +#define TYPE long long
> > > > > +#define N 800
> > > > > +
> > > > > +#pragma GCC target "+nosve"
> > > > > +
> > > > > +TYPE a[N];
> > > > > +
> > > > > +/*
> > > > > +** foo:
> > > > > +** ...
> > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> > > > > +** fmov x[0-9]+, d[0-9]+
> > > > > +** ...
> > > > > +*/
> > > > > +
> > > > > +int foo ()
> > > > > +{
> > > > > +#pragma GCC unroll 4
> > > > > + for (int i = 0; i < N; i++)
> > > > > + if (a[i] == 124)
> > > > > + return 1;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > > > diff --git
> > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > new file mode 100644
> > > > > index
> > > >
> > 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8
> > > > dbe98c79713eaf5607
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > @@ -0,0 +1,33 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" }
> > > > > */
> > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > +
> > > > > +#define TYPE short
> > > > > +#define N 800
> > > > > +
> > > > > +#pragma GCC target "+nosve"
> > > > > +
> > > > > +TYPE a[N];
> > > > > +
> > > > > +/*
> > > > > +** foo:
> > > > > +** ...
> > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> > > > > +** fmov x[0-9]+, d[0-9]+
> > > > > +** ...
> > > > > +*/
> > > > > +
> > > > > +int foo ()
> > > > > +{
> > > > > +#pragma GCC unroll 16
> > > > > + for (int i = 0; i < N; i++)
> > > > > + if (a[i] == 124)
> > > > > + return 1;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > > > diff --git
> > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > new file mode 100644
> > > > > index
> > > >
> > 0000000000000000000000000000000000000000..8ad42b22024479283d681
> > > > 4d815ef1dce411d1c72
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > @@ -0,0 +1,21 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" }
> > > > > */
> > > > > +
> > > > > +#define TYPE char
> > > > > +#define N 800
> > > > > +
> > > > > +#pragma GCC target "+nosve"
> > > > > +
> > > > > +TYPE a[N];
> > > > > +
> > > > > +int foo ()
> > > > > +{
> > > > > +#pragma GCC unroll 32
> > > > > + for (int i = 0; i < N; i++)
> > > > > + if (a[i] == 124)
> > > > > + return 1;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" }
> > > > > }
> > */
> > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > index
> > > >
> > 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54
> > > > 4c6874d7bec999a8 100644
> > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info
> > loop_vinfo,
> > > > stmt_vec_info stmt_info,
> > > > > gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
> > > > > gcond *cond_stmt = as_a <gcond *>(orig_stmt);
> > > > >
> > > > > - tree cst = build_zero_cst (vectype);
> > > > > + tree vectype_out = vectype;
> > > > > auto bb = gimple_bb (cond_stmt);
> > > > > edge exit_true_edge = EDGE_SUCC (bb, 0);
> > > > > if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> > > > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info
> > > > loop_vinfo, stmt_vec_info stmt_info,
> > > > > else
> > > > > workset.splice (stmts);
> > > > >
> > > > > + /* See if we support ADDHN and use that for the reduction. */
> > > > > + internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> > > > > + bool addhn_supported_p
> > > > > + = direct_internal_fn_supported_p (ifn, vectype,
> > > > > OPTIMIZE_FOR_SPEED);
> > > > > + tree narrow_type = NULL_TREE;
> > > > > + if (addhn_supported_p)
> > > > > + {
> > > > > + /* Calculate the narrowing type for the result. */
> > > > > + auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
> > > > > + auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> > > > > + tree itype = build_nonstandard_integer_type (halfprec,
> > > > > unsignedp);
> > > > > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > > + tree tmp_type = build_vector_type (itype, nunits);
> > > > > + narrow_type = truth_type_for (tmp_type);
> > > > > + }
> > > > > +
> > > > > while (workset.length () > 1)
> > > > > {
> > > > > - new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
> > > > > tree arg0 = workset.pop ();
> > > > > tree arg1 = workset.pop ();
> > > > > - new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0,
> > > > > arg1);
> > > > > + if (addhn_supported_p && workset.length () == 0)
> > > > > + {
> > > > > + new_stmt = gimple_build_call_internal (ifn, 2, arg0,
> > > > > arg1);
> > > > > + vectype_out = narrow_type;
> > > > > + new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > "vexit_reduc");
> > > > > + gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
> > > > > + gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
> > > > > + }
> > > > > + else
> > > > > + {
> > > > > + new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > "vexit_reduc");
> > > > > + new_stmt
> > > > > + = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0,
> > > > > arg1);
> > > > > + }
> > > > > vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > > > > &cond_gsi);
> > > > > workset.quick_insert (0, new_temp);
> > > > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info
> > loop_vinfo,
> > > > stmt_vec_info stmt_info,
> > > > >
> > > > > gcc_assert (new_temp);
> > > > >
> > > > > + tree cst = build_zero_cst (vectype_out);
> > > > > gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
> > > > > update_stmt (orig_stmt);
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)