Re: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Richard Biener Tue, 02 Sep 2025 04:35:40 -0700

On Tue, 2 Sep 2025, Tamar Christina wrote:

> Given a sequence such as
> 
> int foo ()
> {
> #pragma GCC unroll 4
>   for (int i = 0; i < N; i++)
>     if (a[i] == 124)
>       return 1;
> 
>   return 0;
> }
> 
> where a[i] is long long, we will unroll the loop and use an OR reduction for
> early break on Adv. SIMD.  Afterwards the sequence is followed by a 
> compression
> sequence to compress the 128-bit vectors into 64-bits for use by the branch.
> 
> However if we have support for add halving and narrowing then we can instead 
> of
> using an OR, use an ADDHN which will do the combining and narrowing.
> 
> Note that for now I only do the last OR, however if we have more than one 
> level
> of unrolling we could technically chain them.  I will revisit this in another
> up coming early break series, however an unroll of 2 is fairly common.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues and about a 10% improvements
> in this sequence for Adv. SIMD.
> 
> Ok for master?


Hmm, so you are replacing the last bitwise OR with a
addhn which produces a "smaller" vector.  So like

 V4SI tem = V4SI | V4SI;
 if (tem != 0)

->

 V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
 if (tem != 0)

whatever 'halving' now stands for (isn't that .VEC_ADD_HIGH_NARROW?)

I can't see how that's in any way faster?  (the aarch64 testcases
unfortunately stop matching after the addhn)

Also the inputs are vector bools(?), so you should V_C_E them to
data vectors before "adding" them.  And check that they have
a vector mode that's not VnBImode for which I guess the addhn
semantics wouldn't be necessarily good enough.

How would you scale this to workset.length () > 2?  I suppose
for an even number reduce to the half element size first, for
odd you could make it even by first reducing two vectors with IOR?
If small, either check for another narrowing addhn operation or
continue with IOR?

That said, I still fail to see how addhn reduces the critical
latency?
  

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
>       * doc/generic.texi: Document it.
>       * optabs.def (vec_addh_narrow): New.
>       * doc/md.texi: Document it.
>       * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if supported.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
>       * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
>       * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
>       * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.
> 
> ---
> diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> index 
> d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9a885d9d218d9af
>  100644
> --- a/gcc/doc/generic.texi
> +++ b/gcc/doc/generic.texi
> @@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind}, the third is 
> an @code{INTEGER_CST}.
>  @tindex IFN_VEC_WIDEN_MINUS_LO
>  @tindex IFN_VEC_WIDEN_MINUS_EVEN
>  @tindex IFN_VEC_WIDEN_MINUS_ODD
> +@tindex IFN_VEC_ADD_HALVING_NARROW
>  @tindex VEC_UNPACK_HI_EXPR
>  @tindex VEC_UNPACK_LO_EXPR
>  @tindex VEC_UNPACK_FLOAT_HI_EXPR
> @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In the case of
>  vector are subtracted from the odd @code{N/2} of the first to produce the
>  vector of @code{N/2} subtractions.
>  
> +@item IFN_VEC_ADD_HALVING_NARROW
> +This internal function performs an addition of two input vectors,
> +then extracts the most significant half of each result element and
> +narrows it back to the original element width.
> +
> +Concretely, it computes:
> +@code{(bits(a)/2)((a + b) >> bits(a))}
> +
> +where @code{bits(a)} is the width in bits of each input element.
> +
> +Its operands are vectors containing the same number of elements (@code{N})
> +of the same integral type.  The result is a vector of length @code{N}, with
> +elements of an integral type whose size is half that of the input element
> +type.
> +
> +This operation currently only used for early break result compression when 
> the
> +result of a vector boolean can be represented as 0 or -1.
> +
>  @item VEC_UNPACK_HI_EXPR
>  @itemx VEC_UNPACK_LO_EXPR
>  These nodes represent unpacking of the high and low parts of the input 
> vector,
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 
> aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9b28ba8d52e5d464
>  100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of size S@.  
> Find the absolute
>  difference between operands 1 and 2 and widen the resulting elements.
>  Put the N/2 results of size 2*S in the output vector (operand 0).
>  
> +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> +@item @samp{vec_addh_narrow@var{m}}
> +Signed or unsigned addition of two input vectors, then extracts the
> +most significant half of each result element and narrows it back to the
> +original element width.
> +
> +Concretely, it computes:
> +@code{(bits(a)/2)((a + b) >> bits(a))}
> +
> +where @code{bits(a)} is the width in bits of each input element.
> +
> +Its operands (@code{1} and @code{2}) are vectors containing the same number
> +of signed or unsigned integral elements (@code{N}) of size @code{S}.  The
> +result (operand @code{0}) is a vector of length @code{N}, with elements of
> +an integral type whose size is half that of @code{S}.
> +
> +This operation currently only used for early break result compression when 
> the
> +result of a vector boolean can be represented as 0 or -1.
> +
>  @cindex @code{vec_addsub@var{m}3} instruction pattern
>  @item @samp{vec_addsub@var{m}3}
>  Alternating subtract, add with even lanes doing subtract and odd
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 
> d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b31d0abc9adb67867
>  100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, 
> cadd270, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
>  DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
> +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST | ECF_NOTHROW,
> +                    vec_addh_narrow, binary)
>  DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
>                               ECF_CONST | ECF_NOTHROW,
>                               first,
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 
> 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d7972db81b542b32c9eb8
>  100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab, "vec_widen_uabd_hi_$a")
>  OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
>  OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
>  OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
>  OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
>  OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
>  OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8b063e418a75a23c525d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define TYPE int
> +#define N 800
> +
> +#pragma GCC target "+nosve"
> +
> +TYPE a[N];
> +
> +/*
> +** foo:
> +**   ...
> +**   ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> +**   cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> +**   cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> +**   addhn   v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> +**   fmov    x[0-9]+, d[0-9]+
> +**   ...
> +*/
> +
> +int foo ()
> +{
> +#pragma GCC unroll 8
> +  for (int i = 0; i < N; i++)
> +    if (a[i] == 124)
> +      return 1;
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c 
> b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..d67d0d13d1733935aaf805e59188eb8155cb5f06
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define TYPE long long
> +#define N 800
> +
> +#pragma GCC target "+nosve"
> +
> +TYPE a[N];
> +
> +/*
> +** foo:
> +**   ...
> +**   ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> +**   cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> +**   cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> +**   addhn   v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> +**   fmov    x[0-9]+, d[0-9]+
> +**   ...
> +*/
> +
> +int foo ()
> +{
> +#pragma GCC unroll 4
> +  for (int i = 0; i < N; i++)
> +    if (a[i] == 124)
> +      return 1;
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c 
> b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8dbe98c79713eaf5607
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define TYPE short
> +#define N 800
> +
> +#pragma GCC target "+nosve"
> +
> +TYPE a[N];
> +
> +/*
> +** foo:
> +**   ...
> +**   ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> +**   cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> +**   cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> +**   addhn   v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> +**   fmov    x[0-9]+, d[0-9]+
> +**   ...
> +*/
> +
> +int foo ()
> +{
> +#pragma GCC unroll 16
> +  for (int i = 0; i < N; i++)
> +    if (a[i] == 124)
> +      return 1;
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c 
> b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..8ad42b22024479283d6814d815ef1dce411d1c72
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-vect-details -std=c99" } */
> +
> +#define TYPE char
> +#define N 800
> +
> +#pragma GCC target "+nosve"
> +
> +TYPE a[N];
> +
> +int foo ()
> +{
> +#pragma GCC unroll 32
> +  for (int i = 0; i < N; i++)
> +    if (a[i] == 124)
> +      return 1;
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" "vect" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb544c6874d7bec999a8
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
> stmt_vec_info stmt_info,
>    gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
>    gcond *cond_stmt = as_a <gcond *>(orig_stmt);
>  
> -  tree cst = build_zero_cst (vectype);
> +  tree vectype_out = vectype;
>    auto bb = gimple_bb (cond_stmt);
>    edge exit_true_edge = EDGE_SUCC (bb, 0);
>    if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
> stmt_vec_info stmt_info,
>        else
>       workset.splice (stmts);
>  
> +      /* See if we support ADDHN and use that for the reduction.  */
> +      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> +      bool addhn_supported_p
> +     = direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED);
> +      tree narrow_type = NULL_TREE;
> +      if (addhn_supported_p)
> +     {
> +       /* Calculate the narrowing type for the result.  */
> +       auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
> +       auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> +       tree itype = build_nonstandard_integer_type (halfprec, unsignedp);
> +       poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +       tree tmp_type = build_vector_type (itype, nunits);
> +       narrow_type = truth_type_for (tmp_type);
> +     }
> +
>        while (workset.length () > 1)
>       {
> -       new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
>         tree arg0 = workset.pop ();
>         tree arg1 = workset.pop ();
> -       new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> +       if (addhn_supported_p && workset.length () == 0)
> +         {
> +           new_stmt = gimple_build_call_internal (ifn, 2, arg0, arg1);
> +           vectype_out = narrow_type;
> +           new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
> +           gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
> +           gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
> +         }
> +       else
> +         {
> +           new_temp = make_temp_ssa_name (vectype_out, NULL, "vexit_reduc");
> +           new_stmt
> +             = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> +         }
>         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
>                                      &cond_gsi);
>         workset.quick_insert (0, new_temp);
> @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info loop_vinfo, 
> stmt_vec_info stmt_info,
>  
>    gcc_assert (new_temp);
>  
> +  tree cst = build_zero_cst (vectype_out);
>    gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
>    update_stmt (orig_stmt);
>  
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Reply via email to