On Tue, 2 Sep 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, September 2, 2025 1:30 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression instead of
> > inclusive OR when reducing comparison values
> > 
> > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > 
> > > Given a sequence such as
> > >
> > > int foo ()
> > > {
> > > #pragma GCC unroll 4
> > >   for (int i = 0; i < N; i++)
> > >     if (a[i] == 124)
> > >       return 1;
> > >
> > >   return 0;
> > > }
> > >
> > > where a[i] is long long, we will unroll the loop and use an OR reduction 
> > > for
> > > early break on Adv. SIMD.  Afterwards the sequence is followed by a 
> > > compression
> > > sequence to compress the 128-bit vectors into 64-bits for use by the 
> > > branch.
> > >
> > > However if we have support for add halving and narrowing then we can 
> > > instead
> > of
> > > using an OR, use an ADDHN which will do the combining and narrowing.
> > >
> > > Note that for now I only do the last OR, however if we have more than one 
> > > level
> > > of unrolling we could technically chain them.  I will revisit this in 
> > > another
> > > up coming early break series, however an unroll of 2 is fairly common.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > -m32, -m64 and no issues and about a 10% improvements
> > > in this sequence for Adv. SIMD.
> > >
> > > Ok for master?
> > 
> > Hmm, so you are replacing the last bitwise OR with a
> > addhn which produces a "smaller" vector.  So like
> > 
> >  V4SI tem = V4SI | V4SI;
> >  if (tem != 0)
> > 
> > ->
> > 
> >  V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
> >  if (tem != 0)
> > 
> > whatever 'halving' now stands for (isn't that .VEC_ADD_HIGH_NARROW?)
> > 
> 
> Yeah, but it retrieved the high half, so open to suggestion but I think any 
> name
> would be confusion..
> 
> > I can't see how that's in any way faster?  (the aarch64 testcases
> > unfortunately stop matching after the addhn)
> > 
> 
> Which is intentional.
> 
> The original code with the ORR is
> 
>         ldp     q31, q30, [x0]
>         cmeq    v31.2d, v31.2d, v29.2d
>         cmeq    v30.2d, v30.2d, v29.2d
>         orr     v31.16b, v31.16b, v30.16b
>         umaxp   v31.4s, v31.4s, v31.4s
>         fmov    x3, d31
> 
> because the result of the ORR is a 128-bit vector it needs to be compressed
> into 64 bits to be transferred to GPR so the != 0 can be performed.
> 
> ADDHN does the combination and compression in one step. i.e. 
> 
> Orr + umaxp -> addhn.

Ah, I see.  So AdvSIMD lacks a ptest, and instead you go to gpr.
The above code does a max reduction and the fmov moves the
scalar reduction result to a GPR?  But with addhn you move
the whole (64bit) vector reg to a GPR?

It seems to me that on the vectorizer side it's not so interesting
to know the target can do addhn but that the target can't do
a {u,}cmpv4si with EQ?  That is, without the patch the vectorizer
generates a GIMPLE_COND that isn't supported by the target?

We check for cbranch, so currently you say you can do it but emulate
it with umaxp + fmov?

What do you do when there's only one V4SImode vector?  You
could pack (truncate) that to V4HImode, right?  Aka
vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is
then 'x'?

> > Also the inputs are vector bools(?), so you should V_C_E them to
> > data vectors before "adding" them.  And check that they have
> > a vector mode that's not VnBImode for which I guess the addhn
> > semantics wouldn't be necessarily good enough.
> 
> ADDHN can't be used for SVE (and so the optab isn't implemented on SVE modes)
> because SVE's version is even/odd.  But for SVE we also don't want this 
> codegen
> because SVE can branch on the result of the data compare.  So we don't want 
> the
> intermediate forced compression.  So this is strictly for Adv. SIMD.
> 
> > 
> > How would you scale this to workset.length () > 2?  I suppose
> > for an even number reduce to the half element size first, for
> > odd you could make it even by first reducing two vectors with IOR?
> > If small, either check for another narrowing addhn operation or
> > continue with IOR?
> > 
> 
> Because the instruction can't work on bytes,  having > 2 just uses ORR
> Until we have == 2 and then ADDHN.  You could use ADDHN for the
> Intermediate steps, but ADDHN hits a limit when you reach bytes.
>
> However one big benefit of using the ADDHN even in > 2 cases is that
> it prevents reassoc from breaking the ORR order we created in the
> vectorizer as it can't reassociate them back to a linear form as it does
> today.
> 
> And the reason we can't match the ADDHN in the backend is that in
> order for us to know that the inputs are Boolean vectors we also have
> to match the compares.  This means that the chain is longer than what
> combine tries since it has to match everything including the if_then_else
> and the reduction to set the CC.
> 
> > That said, I still fail to see how addhn reduces the critical
> > latency?
> > 
> 
> Because it replaces 2 instruction on the critical reduction path with 1
> that is half the latency of the two it replaced.  See example above.

So it's good enough to indeed combine the last two elements like your
patch does.

That said, I still wonder about the trigger - it shouldn't be
availability of the instruction as I'd think a addhn should be
not cheaper than a simple bitwise OR.  Instead it's that
cbranch on the wider vector isn't available?

Richard.

> Thanks,
> Tamar
> 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >   * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
> > >   * doc/generic.texi: Document it.
> > >   * optabs.def (vec_addh_narrow): New.
> > >   * doc/md.texi: Document it.
> > >   * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if supported.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
> > >   * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
> > >   * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
> > >   * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > index
> > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9
> > a885d9d218d9af 100644
> > > --- a/gcc/doc/generic.texi
> > > +++ b/gcc/doc/generic.texi
> > > @@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the
> > third is an @code{INTEGER_CST}.
> > >  @tindex IFN_VEC_WIDEN_MINUS_LO
> > >  @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > >  @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > +@tindex IFN_VEC_ADD_HALVING_NARROW
> > >  @tindex VEC_UNPACK_HI_EXPR
> > >  @tindex VEC_UNPACK_LO_EXPR
> > >  @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In the case of
> > >  vector are subtracted from the odd @code{N/2} of the first to produce the
> > >  vector of @code{N/2} subtractions.
> > >
> > > +@item IFN_VEC_ADD_HALVING_NARROW
> > > +This internal function performs an addition of two input vectors,
> > > +then extracts the most significant half of each result element and
> > > +narrows it back to the original element width.
> > > +
> > > +Concretely, it computes:
> > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > +
> > > +where @code{bits(a)} is the width in bits of each input element.
> > > +
> > > +Its operands are vectors containing the same number of elements 
> > > (@code{N})
> > > +of the same integral type.  The result is a vector of length @code{N}, 
> > > with
> > > +elements of an integral type whose size is half that of the input element
> > > +type.
> > > +
> > > +This operation currently only used for early break result compression 
> > > when the
> > > +result of a vector boolean can be represented as 0 or -1.
> > > +
> > >  @item VEC_UNPACK_HI_EXPR
> > >  @itemx VEC_UNPACK_LO_EXPR
> > >  These nodes represent unpacking of the high and low parts of the input 
> > > vector,
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index
> > aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9
> > b28ba8d52e5d464 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size
> > S@.  Find the absolute
> > >  difference between operands 1 and 2 and widen the resulting elements.
> > >  Put the N/2 results of size 2*S in the output vector (operand 0).
> > >
> > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> > > +@item @samp{vec_addh_narrow@var{m}}
> > > +Signed or unsigned addition of two input vectors, then extracts the
> > > +most significant half of each result element and narrows it back to the
> > > +original element width.
> > > +
> > > +Concretely, it computes:
> > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > +
> > > +where @code{bits(a)} is the width in bits of each input element.
> > > +
> > > +Its operands (@code{1} and @code{2}) are vectors containing the same
> > number
> > > +of signed or unsigned integral elements (@code{N}) of size @code{S}.  The
> > > +result (operand @code{0}) is a vector of length @code{N}, with elements 
> > > of
> > > +an integral type whose size is half that of @code{S}.
> > > +
> > > +This operation currently only used for early break result compression 
> > > when the
> > > +result of a vector boolean can be represented as 0 or -1.
> > > +
> > >  @cindex @code{vec_addsub@var{m}3} instruction pattern
> > >  @item @samp{vec_addsub@var{m}3}
> > >  Alternating subtract, add with even lanes doing subtract and odd
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index
> > d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b
> > 31d0abc9adb67867 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270,
> > ECF_CONST, cadd270, binary)
> > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj,
> > binary)
> > >  DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
> > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST |
> > ECF_NOTHROW,
> > > +                vec_addh_narrow, binary)
> > >  DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
> > >                           ECF_CONST | ECF_NOTHROW,
> > >                           first,
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index
> > 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797
> > 2db81b542b32c9eb8 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > "vec_widen_uabd_hi_$a")
> > >  OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > >  OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > >  OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
> > >  OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > >  OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > >  OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8
> > b063e418a75a23c525d
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE int
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +**       ...
> > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +**       cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > +**       cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > +**       addhn   v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> > > +**       fmov    x[0-9]+, d[0-9]+
> > > +**       ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 8
> > > +  for (int i = 0; i < N; i++)
> > > +    if (a[i] == 124)
> > > +      return 1;
> > > +
> > > +  return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80
> > 5e59188eb8155cb5f06
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE long long
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +**       ...
> > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +**       cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > +**       cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > +**       addhn   v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> > > +**       fmov    x[0-9]+, d[0-9]+
> > > +**       ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 4
> > > +  for (int i = 0; i < N; i++)
> > > +    if (a[i] == 124)
> > > +      return 1;
> > > +
> > > +  return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8
> > dbe98c79713eaf5607
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define TYPE short
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +/*
> > > +** foo:
> > > +**       ...
> > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > +**       cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > +**       cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > +**       addhn   v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> > > +**       fmov    x[0-9]+, d[0-9]+
> > > +**       ...
> > > +*/
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 16
> > > +  for (int i = 0; i < N; i++)
> > > +    if (a[i] == 124)
> > > +      return 1;
> > > +
> > > +  return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..8ad42b22024479283d681
> > 4d815ef1dce411d1c72
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > @@ -0,0 +1,21 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> > > +
> > > +#define TYPE char
> > > +#define N 800
> > > +
> > > +#pragma GCC target "+nosve"
> > > +
> > > +TYPE a[N];
> > > +
> > > +int foo ()
> > > +{
> > > +#pragma GCC unroll 32
> > > +  for (int i = 0; i < N; i++)
> > > +    if (a[i] == 124)
> > > +      return 1;
> > > +
> > > +  return 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" } } */
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index
> > 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54
> > 4c6874d7bec999a8 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo,
> > stmt_vec_info stmt_info,
> > >    gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
> > >    gcond *cond_stmt = as_a <gcond *>(orig_stmt);
> > >
> > > -  tree cst = build_zero_cst (vectype);
> > > +  tree vectype_out = vectype;
> > >    auto bb = gimple_bb (cond_stmt);
> > >    edge exit_true_edge = EDGE_SUCC (bb, 0);
> > >    if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info
> > loop_vinfo, stmt_vec_info stmt_info,
> > >        else
> > >   workset.splice (stmts);
> > >
> > > +      /* See if we support ADDHN and use that for the reduction.  */
> > > +      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> > > +      bool addhn_supported_p
> > > + = direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED);
> > > +      tree narrow_type = NULL_TREE;
> > > +      if (addhn_supported_p)
> > > + {
> > > +   /* Calculate the narrowing type for the result.  */
> > > +   auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
> > > +   auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> > > +   tree itype = build_nonstandard_integer_type (halfprec, unsignedp);
> > > +   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > +   tree tmp_type = build_vector_type (itype, nunits);
> > > +   narrow_type = truth_type_for (tmp_type);
> > > + }
> > > +
> > >        while (workset.length () > 1)
> > >   {
> > > -   new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
> > >     tree arg0 = workset.pop ();
> > >     tree arg1 = workset.pop ();
> > > -   new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> > > +   if (addhn_supported_p && workset.length () == 0)
> > > +     {
> > > +       new_stmt = gimple_build_call_internal (ifn, 2, arg0, arg1);
> > > +       vectype_out = narrow_type;
> > > +       new_temp = make_temp_ssa_name (vectype_out, NULL,
> > "vexit_reduc");
> > > +       gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
> > > +       gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
> > > +     }
> > > +   else
> > > +     {
> > > +       new_temp = make_temp_ssa_name (vectype_out, NULL,
> > "vexit_reduc");
> > > +       new_stmt
> > > +         = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> > > +     }
> > >     vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > >                                  &cond_gsi);
> > >     workset.quick_insert (0, new_temp);
> > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo,
> > stmt_vec_info stmt_info,
> > >
> > >    gcc_assert (new_temp);
> > >
> > > +  tree cst = build_zero_cst (vectype_out);
> > >    gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
> > >    update_stmt (orig_stmt);
> > >
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to