On Tue, 2 Sep 2025, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, September 2, 2025 1:30 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression instead of
> > inclusive OR when reducing comparison values
> >
> > On Tue, 2 Sep 2025, Tamar Christina wrote:
> >
> > > Given a sequence such as
> > >
> > > int foo ()
> > > {
> > > #pragma GCC unroll 4
> > > for (int i = 0; i < N; i++)
> > > if (a[i] == 124)
> > > return 1;
> > >
> > > return 0;
> > > }
> > >
> > > where a[i] is long long, we will unroll the loop and use an OR reduction
> > > for
> > > early break on Adv. SIMD. Afterwards the sequence is followed by a
> > > compression
> > > sequence to compress the 128-bit vectors into 64-bits for use by the
> > > branch.
> > >
> > > However if we have support for add halving and narrowing then we can
> > > instead
> > of
> > > using an OR, use an ADDHN which will do the combining and narrowing.
> > >
> > > Note that for now I only do the last OR, however if we have more than one
> > > level
> > > of unrolling we could technically chain them. I will revisit this in
> > > another
> > > up coming early break series, however an unroll of 2 is fairly common.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > -m32, -m64 and no issues and about a 10% improvements
> > > in this sequence for Adv. SIMD.
> > >
> > > Ok for master?
> >
> > Hmm, so you are replacing the last bitwise OR with a
> > addhn which produces a "smaller" vector. So like
> >
> > V4SI tem = V4SI | V4SI;
> > if (tem != 0)
> >
> > ->
> >
> > V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
> > if (tem != 0)
> >
> > whatever 'halving' now stands for (isn't that .VEC_ADD_HIGH_NARROW?)
> >
>
> Yeah, but it retrieved the high half, so open to suggestion but I think any
> name
> would be confusion..
>
> > I can't see how that's in any way faster? (the aarch64 testcases
> > unfortunately stop matching after the addhn)
> >
>
> Which is intentional.
>
> The original code with the ORR is
>
> ldp q31, q30, [x0]
> cmeq v31.2d, v31.2d, v29.2d
> cmeq v30.2d, v30.2d, v29.2d
> orr v31.16b, v31.16b, v30.16b
> umaxp v31.4s, v31.4s, v31.4s
> fmov x3, d31
>
> because the result of the ORR is a 128-bit vector it needs to be compressed
> into 64 bits to be transferred to GPR so the != 0 can be performed.
>
> ADDHN does the combination and compression in one step. i.e.
>
> Orr + umaxp -> addhn.
Ah, I see. So AdvSIMD lacks a ptest, and instead you go to gpr.
The above code does a max reduction and the fmov moves the
scalar reduction result to a GPR? But with addhn you move
the whole (64bit) vector reg to a GPR?
It seems to me that on the vectorizer side it's not so interesting
to know the target can do addhn but that the target can't do
a {u,}cmpv4si with EQ? That is, without the patch the vectorizer
generates a GIMPLE_COND that isn't supported by the target?
We check for cbranch, so currently you say you can do it but emulate
it with umaxp + fmov?
What do you do when there's only one V4SImode vector? You
could pack (truncate) that to V4HImode, right? Aka
vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is
then 'x'?
> > Also the inputs are vector bools(?), so you should V_C_E them to
> > data vectors before "adding" them. And check that they have
> > a vector mode that's not VnBImode for which I guess the addhn
> > semantics wouldn't be necessarily good enough.
>
> ADDHN can't be used for SVE (and so the optab isn't implemented on SVE modes)
> because SVE's version is even/odd. But for SVE we also don't want this
> codegen
> because SVE can branch on the result of the data compare. So we don't want
> the
> intermediate forced compression. So this is strictly for Adv. SIMD.
>
> >
> > How would you scale this to workset.length () > 2? I suppose
> > for an even number reduce to the half element size first, for
> > odd you could make it even by first reducing two vectors with IOR?
> > If small, either check for another narrowing addhn operation or
> > continue with IOR?
> >
>
> Because the instruction can't work on bytes, having > 2 just uses ORR
> Until we have == 2 and then ADDHN. You could use ADDHN for the
> Intermediate steps, but ADDHN hits a limit when you reach bytes.
>
> However one big benefit of using the ADDHN even in > 2 cases is that
> it prevents reassoc from breaking the ORR order we created in the
> vectorizer as it can't reassociate them back to a linear form as it does
> today.
>
> And the reason we can't match the ADDHN in the backend is that in
> order for us to know that the inputs are Boolean vectors we also have
> to match the compares. This means that the chain is longer than what
> combine tries since it has to match everything including the if_then_else
> and the reduction to set the CC.
>
> > That said, I still fail to see how addhn reduces the critical
> > latency?
> >
>
> Because it replaces 2 instruction on the critical reduction path with 1
> that is half the latency of the two it replaced. See example above.
So it's good enough to indeed combine the last two elements like your
patch does.
That said, I still wonder about the trigger - it shouldn't be
availability of the instruction as I'd think a addhn should be
not cheaper than a simple bitwise OR. Instead it's that
cbranch on the wider vector isn't available?
Richard.
> Thanks,
> Tamar
>
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
> > > * doc/generic.texi: Document it.
> > > * optabs.def (vec_addh_narrow): New.
> > > * doc/md.texi: Document it.
> > > * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if supported.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
> > > * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
> > > * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
> > > * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > index
> > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9
> > a885d9d218d9af 100644
> > > --- a/gcc/doc/generic.texi
> > > +++ b/gcc/doc/generic.texi
> > > @@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the
> > third is an @code{INTEGER_CST}.
> > > @tindex IFN_VEC_WIDEN_MINUS_LO
> > > @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > > @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > +@tindex IFN_VEC_ADD_HALVING_NARROW
> > > @tindex VEC_UNPACK_HI_EXPR
> > > @tindex VEC_UNPACK_LO_EXPR
> > > @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions. In the case of
> > > vector are subtracted from the odd @code{N/2} of the first to produce the
> > > vector of @code{N/2} subtractions.
> > >
> > > +@item IFN_VEC_ADD_HALVING_NARROW
> > > +This internal function performs an addition of two input vectors,
> > > +then extracts the most significant half of each result element and
> > > +narrows it back to the original element width.
> > > +
> > > +Concretely, it computes:
> > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > +
> > > +where @code{bits(a)} is the width in bits of each input element.
> > > +
> > > +Its operands are vectors containing the same number of elements
> > > (@code{N})
> > > +of the same integral type. The result is a vector of length @code{N},
> > > with
> > > +elements of an integral type whose size is half that of the input element
> > > +type.
> > > +
> > > +This operation currently only used for early break result compression
> > > when the
> > > +result of a vector boolean can be represented as 0 or -1.
> > > +
> > > @item VEC_UNPACK_HI_EXPR
> > > @itemx VEC_UNPACK_LO_EXPR
> > > These nodes represent unpacking of the high and low parts of the input
> > > vector,
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index
> > aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9
> > b28ba8d52e5d464 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size
> > S@. Find the absolute
> > > difference between operands 1 and 2 and widen the resulting elements.
> > > Put the N/2 results of size 2*S in the output vector (operand 0).
> > >
> > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> > > +@item @samp{vec_addh_narrow@var{m}}
> > > +Signed or unsigned addition of two input vectors, then extracts the
> > > +most significant half of each result element and narrows it back to the
> > > +original element width.
> > > +
> > > +Concretely, it computes:
> > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > +
> > > +where @code{bits(a)} is the width in bits of each input element.
> > > +
> > > +Its operands (@code{1} and @code{2}) are vectors containing the same
> > number
> > > +of signed or unsigned integral elements (@code{N}) of size @code{S}. The
> > > +result (operand @code{0}) is a vector of length @code{N}, with elements
> > > of
> > > +an integral type whose size is half that of @code{S}.
> > > +
> > > +This operation currently only used for early break result compression
> > > when the
> > > +result of a vector boolean can be represented as 0 or -1.
> > > +
> > > @cindex @code{vec_addsub@var{m}3} instruction pattern
> > > @item @samp{vec_addsub@var{m}3}
> > > Alternating subtract, add with even lanes doing subtract and odd
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index
> > d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b
> > 31d0abc9adb67867 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270,
> > ECF_CONST, cadd270, binary)
> > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj,
> > binary)
> > > DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
> > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST |
> > ECF_NOTHROW,
> > > + vec_addh_narrow, binary)
> > > DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
> > > ECF_CONST | ECF_NOTHROW,
> > > first,
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index
> > 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797
> > 2db81b542b32c9eb8 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > "vec_widen_uabd_hi_$a")
> > > OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > > OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > > OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
> > > OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > > OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > > OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8
> > b063e418a75a23c525d
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE int
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +** ...
> > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> > > +** fmov x[0-9]+, d[0-9]+
> > > +** ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 8
> > > + for (int i = 0; i < N; i++)
> > > + if (a[i] == 124)
> > > + return 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80
> > 5e59188eb8155cb5f06
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE long long
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +** ...
> > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> > > +** fmov x[0-9]+, d[0-9]+
> > > +** ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 4
> > > + for (int i = 0; i < N; i++)
> > > + if (a[i] == 124)
> > > + return 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8
> > dbe98c79713eaf5607
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE short
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +** ...
> > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> > > +** fmov x[0-9]+, d[0-9]+
> > > +** ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 16
> > > + for (int i = 0; i < N; i++)
> > > + if (a[i] == 124)
> > > + return 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..8ad42b22024479283d681
> > 4d815ef1dce411d1c72
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > @@ -0,0 +1,21 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +
> > > +#define TYPE char
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 32
> > > + for (int i = 0; i < N; i++)
> > > + if (a[i] == 124)
> > > + return 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index
> > 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54
> > 4c6874d7bec999a8 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo,
> > stmt_vec_info stmt_info,
> > > gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
> > > gcond *cond_stmt = as_a <gcond *>(orig_stmt);
> > >
> > > - tree cst = build_zero_cst (vectype);
> > > + tree vectype_out = vectype;
> > > auto bb = gimple_bb (cond_stmt);
> > > edge exit_true_edge = EDGE_SUCC (bb, 0);
> > > if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info
> > loop_vinfo, stmt_vec_info stmt_info,
> > > else
> > > workset.splice (stmts);
> > >
> > > + /* See if we support ADDHN and use that for the reduction. */
> > > + internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> > > + bool addhn_supported_p
> > > + = direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED);
> > > + tree narrow_type = NULL_TREE;
> > > + if (addhn_supported_p)
> > > + {
> > > + /* Calculate the narrowing type for the result. */
> > > + auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
> > > + auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> > > + tree itype = build_nonstandard_integer_type (halfprec, unsignedp);
> > > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > + tree tmp_type = build_vector_type (itype, nunits);
> > > + narrow_type = truth_type_for (tmp_type);
> > > + }
> > > +
> > > while (workset.length () > 1)
> > > {
> > > - new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
> > > tree arg0 = workset.pop ();
> > > tree arg1 = workset.pop ();
> > > - new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> > > + if (addhn_supported_p && workset.length () == 0)
> > > + {
> > > + new_stmt = gimple_build_call_internal (ifn, 2, arg0, arg1);
> > > + vectype_out = narrow_type;
> > > + new_temp = make_temp_ssa_name (vectype_out, NULL,
> > "vexit_reduc");
> > > + gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
> > > + gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
> > > + }
> > > + else
> > > + {
> > > + new_temp = make_temp_ssa_name (vectype_out, NULL,
> > "vexit_reduc");
> > > + new_stmt
> > > + = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> > > + }
> > > vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > > &cond_gsi);
> > > workset.quick_insert (0, new_temp);
> > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo,
> > stmt_vec_info stmt_info,
> > >
> > > gcc_assert (new_temp);
> > >
> > > + tree cst = build_zero_cst (vectype_out);
> > > gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
> > > update_stmt (orig_stmt);
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)