RE: [PATCH 2/5]middle-end: Add detection for add halfing and narrowing instruction

Richard Biener Thu, 21 Aug 2025 03:55:19 -0700

On Wed, 20 Aug 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <richard.guent...@gmail.com>
> > Sent: Wednesday, August 20, 2025 1:48 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de
> > Subject: Re: [PATCH 2/5]middle-end: Add detection for add halfing and 
> > narrowing
> > instruction
> > 
> > On Tue, Aug 19, 2025 at 6:29 AM Tamar Christina <tamar.christ...@arm.com>
> > wrote:
> > >
> > > This adds support for detectioon of the ADDHN pattern in the vectorizer.
> > >
> > > Concretely try to detect
> > >
> > >  _1 = (W)a
> > >  _2 = (W)b
> > >  _3 = _1 + _2
> > >  _4 = _3 >> (precision(a) / 2)
> > >  _5 = (N)_4
> > >
> > >  where
> > >    W = precision (a) * 2
> > >    N = precision (a) / 2
> > 
> > Hmm.  Is the widening because of UB with signed overflow?  The
> > actual carry of a + b doesn't end up in (N)(_3 >> (precision(a) / 2)).
> > I'd expect that for unsigned a and b you could see just
> > (N)((a + b) >> (precision(a) / 2)), no?  Integer promotion would make
> > this difficult to write, of course, unless the patterns exist for SImode
> > -> HImode add-high.
> > 
> 
> I guess the description is inaccurate, addhn extract explicitly the high
> bits of the results. So the high bits will end up in the low part.
> 
> > Also ...
> > 
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > -m32, -m64 and no issues.
> > >
> > > Ok for master? Tests in the next patch which adds the optabs to AArch64.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >         * internal-fn.def (VEC_ADD_HALFING_NARROW,
> > >         IFN_VEC_ADD_HALFING_NARROW_LO,
> > IFN_VEC_ADD_HALFING_NARROW_HI,
> > >         IFN_VEC_ADD_HALFING_NARROW_EVEN,
> > IFN_VEC_ADD_HALFING_NARROW_ODD): New.
> > >         * internal-fn.cc (commutative_binary_fn_p): Add
> > >         IFN_VEC_ADD_HALFING_NARROW, IFN_VEC_ADD_HALFING_NARROW_LO
> > and
> > >         IFN_VEC_ADD_HALFING_NARROW_EVEN.
> > >         (commutative_ternary_fn_p): Add IFN_VEC_ADD_HALFING_NARROW_HI,
> > >         IFN_VEC_ADD_HALFING_NARROW_ODD.
> > >         * match.pd (add_half_narrowing_p): New.
> > >         * optabs.def (vec_saddh_narrow_optab, vec_saddh_narrow_hi_optab,
> > >         vec_saddh_narrow_lo_optab, vec_saddh_narrow_odd_optab,
> > >         vec_saddh_narrow_even_optab, vec_uaddh_narrow_optab,
> > >         vec_uaddh_narrow_hi_optab, vec_uaddh_narrow_lo_optab,
> > >         vec_uaddh_narrow_odd_optab, vec_uaddh_narrow_even_optab): New.
> > >         * tree-vect-patterns.cc (gimple_add_half_narrowing_p): New.
> > >         (vect_recog_add_halfing_narrow_pattern): New.
> > >         (vect_vect_recog_func_ptrs): Use it.
> > >         * doc/generic.texi: Document them.
> > >         * doc/md.texi: Likewise.
> > >
> > > ---
> > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > index
> > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..b32d99d4d1aad244a493d8f
> > 67b66151ff5363d0e 100644
> > > --- a/gcc/doc/generic.texi
> > > +++ b/gcc/doc/generic.texi
> > > @@ -1834,6 +1834,11 @@ a value from @code{enum annot_expr_kind}, the
> > third is an @code{INTEGER_CST}.
> > >  @tindex IFN_VEC_WIDEN_MINUS_LO
> > >  @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > >  @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > +@tindex IFN_VEC_ADD_HALFING_NARROW
> > > +@tindex IFN_VEC_ADD_HALFING_NARROW_HI
> > > +@tindex IFN_VEC_ADD_HALFING_NARROW_LO
> > > +@tindex IFN_VEC_ADD_HALFING_NARROW_EVEN
> > > +@tindex IFN_VEC_ADD_HALFING_NARROW_ODD
> > >  @tindex VEC_UNPACK_HI_EXPR
> > >  @tindex VEC_UNPACK_LO_EXPR
> > >  @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > @@ -1956,6 +1961,51 @@ vector of @code{N/2} subtractions.  In the case of
> > >  vector are subtracted from the odd @code{N/2} of the first to produce the
> > >  vector of @code{N/2} subtractions.
> > >
> > > +@item IFN_VEC_ADD_HALFING_NARROW
> > > +This internal function represents widening vector addition of two input
> > > +vectors, extracting the top half of the result and narrow that value to 
> > > a type
> > > +half that of the original input.
> > > +Congretely it does @code{((|bits(a)/2|)((a w+ b) >> |bits(a)/2|)}.  Its 
> > > operands
> > > +are vectors that contain the same number of elements (@code{N}) of the 
> > > same
> > > +integral type.  The result is a vector that contains the same amount 
> > > (@code{N})
> > > +of elements, of an integral type whose size is twice as narrow, as the 
> > > input
> > > +vectors.  If the current target does not implement the corresponding 
> > > optabs the
> > > +vectorizer may choose to split it into either a pair
> > > +of @code{IFN_VEC_ADD_HALFING_NARROW_HI} and
> > @code{IFN_VEC_ADD_HALFING_NARROW_LO}
> > > +or @code{IFN_VEC_ADD_HALFING_NARROW_EVEN} and
> > > +@code{IFN_VEC_ADD_HALFING_NARROW_ODD}, depending on what optabs
> > the target
> > > +implements.
> > > +
> > > +@item IFN_VEC_ADD_HALFING_NARROW_HI
> > > +@itemx IFN_VEC_ADD_HALFING_NARROW_LO
> > > +This internal function represents widening vector addition of two input
> > > +vectors, extracting the top half of the result and narrow that value to 
> > > a type
> > > +half that of the original input inserting the result as the high or low 
> > > half of
> > > +the result vector.
> > > +Congretely it does @code{((|bits(a)/2|)((a w+ b) >> |bits(a)/2|)}.  Their
> > > +operands are vectors that contain the same number of elements (@code{N}) 
> > > of
> > the
> > > +same integral type. The result is a vector that contains half as many 
> > > elements,
> > > +of an integral type whose size is twice as narrow.  In the case of
> > > +@code{IFN_VEC_ADD_HALFING_NARROW_HI} the high @code{N/2} elements
> > of the result
> > > +is inserted into the given result vector with the low elements left 
> > > untouched.
> > > +The operation is a RMW.  In the case of
> > @code{IFN_VEC_ADD_HALFING_NARROW_LO} the
> > > +low @code{N/2} elements of the result is used as the full result.
> > > +
> > > +@item IFN_VEC_ADD_HALFING_NARROW_EVEN
> > > +@itemx IFN_VEC_ADD_HALFING_NARROW_ODD
> > > +This internal function represents widening vector addition of two input
> > > +vectors, extracting the top half of the result and narrow that value to 
> > > a type
> > > +half that of the original input inserting the result as the even or odd 
> > > parts of
> > > +the result vector.
> > > +Congretely it does @code{((|bits(a)/2|)((a w+ b) >> |bits(a)/2|)}.  Their
> > > +operands are vectors that contain the same number of elements (@code{N}) 
> > > of
> > the
> > > +same integral type. The result is a vector that contains half as many 
> > > elements,
> > > +of an integral type whose size is twice as narrow.  In the case of
> > > +@code{IFN_VEC_ADD_HALFING_NARROW_ODD} the odd @code{N/2}
> > elements of the result
> > > +is inserted into the given result vector with the even elements left 
> > > untouched.
> > > +The operation is a RMW.  In the case of
> > @code{IFN_VEC_ADD_HALFING_NARROW_EVEN}
> > > +the even @code{N/2} elements of the result is used as the full result.
> > > +
> > >  @item VEC_UNPACK_HI_EXPR
> > >  @itemx VEC_UNPACK_LO_EXPR
> > >  These nodes represent unpacking of the high and low parts of the input 
> > > vector,
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index
> > aba93f606eca59d31c103a05b2567fd4f3be55f3..cb691b56f137a0037f5178ba8
> > 53911df5a65e5a7 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -6087,6 +6087,21 @@ vectors with N signed/unsigned elements of size
> > S@.  Find the absolute
> > >  difference between operands 1 and 2 and widen the resulting elements.
> > >  Put the N/2 results of size 2*S in the output vector (operand 0).
> > >
> > > +@cindex @code{vec_saddh_narrow_hi_@var{m}} instruction pattern
> > > +@cindex @code{vec_saddh_narrow_lo_@var{m}} instruction pattern
> > > +@cindex @code{vec_uaddh_narrow_hi_@var{m}} instruction pattern
> > > +@cindex @code{vec_uaddh_narrow_lo_@var{m}} instruction pattern
> > > +@item @samp{vec_uaddh_narrow_hi_@var{m}},
> > @samp{vec_uaddh_narrow_lo_@var{m}}
> > > +@itemx @samp{vec_saddh_narrow_hi_@var{m}},
> > @samp{vec_saddh_narrow_lo_@var{m}}
> > > +@item @samp{vec_uaddh_narrow_even_@var{m}},
> > @samp{vec_uaddh_narrow_even_@var{m}}
> > > +@itemx @samp{vec_saddh_narrow_odd_@var{m}},
> > @samp{vec_saddh_narrow_odd_@var{m}}
> > > +Signed/Unsigned widening add long extract high half and narrow.  
> > > Operands 1
> > and
> > > +2 are vectors with N signed/unsigned elements of size S@.  Add the 
> > > high/low
> > > +elements of 1 and 2 together in a widened precision, extract the top 
> > > half and
> > > +narrow the result to half the size of S@ abd store the results in the 
> > > output
> > > +vector (operand 0).  Congretely it does
> > > +@code{((|bits(a)/2|)((a w+ b) >> |bits(a)/2|)}
> > > +
> > >  @cindex @code{vec_addsub@var{m}3} instruction pattern
> > >  @item @samp{vec_addsub@var{m}3}
> > >  Alternating subtract, add with even lanes doing subtract and odd
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index
> > 83438dd2ff57474cec999adaeabe92c0540e2a51..e600dbc4b3a0b27f78be00d5
> > 2f7f6a54a13d7241 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -4442,6 +4442,9 @@ commutative_binary_fn_p (internal_fn fn)
> > >      case IFN_VEC_WIDEN_PLUS_HI:
> > >      case IFN_VEC_WIDEN_PLUS_EVEN:
> > >      case IFN_VEC_WIDEN_PLUS_ODD:
> > > +    case IFN_VEC_ADD_HALFING_NARROW:
> > > +    case IFN_VEC_ADD_HALFING_NARROW_LO:
> > > +    case IFN_VEC_ADD_HALFING_NARROW_EVEN:
> > >        return true;
> > >
> > >      default:
> > > @@ -4462,6 +4465,8 @@ commutative_ternary_fn_p (internal_fn fn)
> > >      case IFN_FNMA:
> > >      case IFN_FNMS:
> > >      case IFN_UADDC:
> > > +    case IFN_VEC_ADD_HALFING_NARROW_HI:
> > > +    case IFN_VEC_ADD_HALFING_NARROW_ODD:
> > 
> > Huh, how can this be correct?  Are they not binary?
> 
> Correct they're ternary.
> 
> > 
> > >        return true;
> > >
> > >      default:
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index
> > 69677dd10b980c83dec36487b1214ff066f4789b..152895f043b3ca60294b79c
> > 8301c6ff4014b955d 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -463,6 +463,12 @@ DEF_INTERNAL_WIDENING_OPTAB_FN
> > (VEC_WIDEN_ABD,
> > >                                 first,
> > >                                 vec_widen_sabd, vec_widen_uabd,
> > >                                 binary)
> > > +DEF_INTERNAL_NARROWING_OPTAB_FN (VEC_ADD_HALFING_NARROW,
> > > +                               ECF_CONST | ECF_NOTHROW,
> > > +                               first,
> > > +                               vec_saddh_narrow, vec_uaddh_narrow,
> > > +                               binary, ternary)
> > 
> > OK, I guess should have started to look at 1/n.  Doing that now in parallel.
> > 
> > > +
> > >  DEF_INTERNAL_OPTAB_FN (VEC_FMADDSUB, ECF_CONST, vec_fmaddsub,
> > ternary)
> > >  DEF_INTERNAL_OPTAB_FN (VEC_FMSUBADD, ECF_CONST, vec_fmsubadd,
> > ternary)
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index
> > 66e8a78744931c0137b83c5633c3a273fb69f003..d9d9046a8dcb7e5ca7cdf7c8
> > 3e1945289950dc51 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -3181,6 +3181,18 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >         || POINTER_TYPE_P (itype))
> > >        && wi::eq_p (wi::to_wide (int_cst), wi::max_value (itype))))))
> > >
> > > +/* Detect (n)(((w)x + (w)y) >> bitsize(y)) where w is twice the bitsize 
> > > of x and
> > > +    y and n is half the bitsize of x and y.  */
> > > +(match (add_half_narrowing_p @0 @1)
> > > + (convert1? (rshift (plus:c (convert@3 @0) (convert @1)) INTEGER_CST@2))
> > 
> > why's the outer convert optional?  The checks on n and w would make
> > a conversion required I think.  Just use (convert (rshift (... here.
> 
> Because match.pd wouldn't let me do it without the optional conversion.
> The test on the bitsize essentially mandates it's there anyway.


I think using (convert (rshift (plus:c (convert@3 @0) (convert @1)) 
INTEGER_CST@2)) will just work.  Just using conver1 does not.

> > 
> > > + (with { unsigned n = TYPE_PRECISION (type);
> > > +        unsigned w = TYPE_PRECISION (TREE_TYPE (@3));
> > > +        unsigned x = TYPE_PRECISION (TREE_TYPE (@0)); }
> > > +  (if (INTEGRAL_TYPE_P (type)
> > > +       && n == x / 2
> > 
> > Now, because of weird types it would be safer to check n * 2 == x,
> > just in case of odd x ...
> > 
> > Alternatively/additionally check && type_has_mode_precision_p (type)
> > 
> > > +       && w == x * 2
> > > +       && wi::eq_p (wi::to_wide (@2), x / 2)))))
> > > +
> > >  /* Saturation add for unsigned integer.  */
> > >  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type))
> > >   (match (usadd_overflow_mask @0 @1)
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index
> > 87a8b85da1592646d0a3447572e842ceb158cd97..e226d85ddba7e43dd801fae
> > c61cac0372286314a 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -492,6 +492,16 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > "vec_widen_uabd_hi_$a")
> > >  OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > >  OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > >  OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> > > +OPTAB_D (vec_saddh_narrow_optab, "vec_saddh_narrow$a")
> > > +OPTAB_D (vec_saddh_narrow_hi_optab, "vec_saddh_narrow_hi_$a")
> > > +OPTAB_D (vec_saddh_narrow_lo_optab, "vec_saddh_narrow_lo_$a")
> > > +OPTAB_D (vec_saddh_narrow_odd_optab, "vec_saddh_narrow_odd_$a")
> > > +OPTAB_D (vec_saddh_narrow_even_optab, "vec_saddh_narrow_even_$a")
> > > +OPTAB_D (vec_uaddh_narrow_optab, "vec_uaddh_narrow$a")
> > > +OPTAB_D (vec_uaddh_narrow_hi_optab, "vec_uaddh_narrow_hi_$a")
> > > +OPTAB_D (vec_uaddh_narrow_lo_optab, "vec_uaddh_narrow_lo_$a")
> > > +OPTAB_D (vec_uaddh_narrow_odd_optab, "vec_uaddh_narrow_odd_$a")
> > > +OPTAB_D (vec_uaddh_narrow_even_optab, "vec_uaddh_narrow_even_$a")
> > >  OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > >  OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > >  OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > index
> > ffb320fbf2330522f25a9f4380f4744079a42306..b590c36fad23e44ec3fb954a4d
> > 2bb856ce3fc139 100644
> > > --- a/gcc/tree-vect-patterns.cc
> > > +++ b/gcc/tree-vect-patterns.cc
> > > @@ -4768,6 +4768,64 @@ vect_recog_sat_trunc_pattern (vec_info *vinfo,
> > stmt_vec_info stmt_vinfo,
> > >    return NULL;
> > >  }
> > >
> > > +extern bool gimple_add_half_narrowing_p (tree, tree*, tree (*)(tree));
> > > +
> > > +/*
> > > + * Try to detect add halfing and narrowing pattern.
> > > + *
> > > + * _1 = (W)a
> > > + * _2 = (W)b
> > > + * _3 = _1 + _2
> > > + * _4 = _3 >> (precision(a) / 2)
> > > + * _5 = (N)_4
> > > + *
> > > + * where
> > > + *   W = precision (a) * 2
> > > + *   N = precision (a) / 2
> > > + */
> > > +
> > > +static gimple *
> > > +vect_recog_add_halfing_narrow_pattern (vec_info *vinfo,
> > > +                                      stmt_vec_info stmt_vinfo,
> > > +                                      tree *type_out)
> > > +{
> > > +  gimple *last_stmt = STMT_VINFO_STMT (stmt_vinfo);
> > > +
> > > +  if (!is_gimple_assign (last_stmt))
> > > +    return NULL;
> > > +
> > > +  tree ops[2];
> > > +  tree lhs = gimple_assign_lhs (last_stmt);
> > > +
> > > +  if (gimple_add_half_narrowing_p (lhs, ops, NULL))
> > > +    {
> > > +      tree itype = TREE_TYPE (ops[0]);
> > > +      tree otype = TREE_TYPE (lhs);
> > > +      tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> > > +      tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
> > > +      internal_fn ifn = IFN_VEC_ADD_HALFING_NARROW;
> > > +
> > > +      if (v_itype != NULL_TREE && v_otype != NULL_TREE
> > > +         && direct_internal_fn_supported_p (ifn, v_itype, 
> > > OPTIMIZE_FOR_BOTH))
> > 
> > why have the HI/LO and EVEN/ODD variants when you check for
> > IFN_VEC_ADD_HALFING_NARROW
> > only?
> > 
> 
> Because without HI/LO we will have to have quite a few arguments into the 
> actual
> Instruction.  VEC_ADD_HALFING_NARROW does arithmetic as well, so the inputs
> are spread out over the operands. VEC_ADD_HALFING_NARROW would require 4
> inputs, where the first two and last two is used together.  This would be 
> completely
> unclear from the use of the instruction itself. I could, but then it also 
> means if you
> have a narrowing instruction which needs 3 inputs that the IFN needs 6. It 
> did not seem
> logical to do so.

I am asking why you require support for a single out of the 5 IFNs during
pattern recog when, for example, the target might only support _hi/_lo.

Yes, the pattern has to use the "scalar" VEC_ADD_HALFING_NARROW
(not in the packing way you implemented, but in the {V4SI,V4SI}->V4HI
way that's also "compatible" with scalar types).  vectorizable_* will
then select the appropriate supported variant, also based on vector
types.  Usually patterns call vect_supportable_narrowing_operation
(in case we have that, we do for widening), which then checks the 
variants.

> The alternative would have been to use just two inputs and use VEC_PERM_EXPR 
> to
> combine them.   This would work for HI/LO, but then require backends to then 
> recognize
> the permute back into hi/lo operations, taking into account endianness.  
> Possible but seemed
> a roundabout way of doing it.
> 
> Secondly it doesn't work for even/odd. VEC_PERM would fill in only a strided 
> value of the
> vector at a time.  This becomes difficult for VLA and then you have to do 
> tricks like discount
> the costing of the permute if it's following an instruction you have even/odd 
> variant of.
> 
> Concretely using VEC_ADD_HALFING_NARROW creates more issues than it solves, 
> but if
> you want that variant I will respin.
> 
> Tamar
> 
> > > +       {
> > > +         gcall *call = gimple_build_call_internal (ifn, 2, ops[0], 
> > > ops[1]);
> > > +         tree in_ssa = vect_recog_temp_ssa_var (otype, NULL);
> > > +
> > > +         gimple_call_set_lhs (call, in_ssa);
> > > +         gimple_call_set_nothrow (call, /* nothrow_p */ false);
> > > +         gimple_set_location (call,
> > > +                              gimple_location (STMT_VINFO_STMT 
> > > (stmt_vinfo)));
> > > +
> > > +         *type_out = v_otype;
> > > +         vect_pattern_detected ("vect_recog_add_halfing_narrow_pattern",
> > > +                                last_stmt);
> > > +         return call;
> > > +       }
> > > +    }
> > > +
> > > +  return NULL;
> > > +}
> > > +
> > >  /* Detect a signed division by a constant that wouldn't be
> > >     otherwise vectorized:
> > >
> > > @@ -6896,6 +6954,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] =
> > {
> > >    { vect_recog_bitfield_ref_pattern, "bitfield_ref" },
> > >    { vect_recog_bit_insert_pattern, "bit_insert" },
> > >    { vect_recog_abd_pattern, "abd" },
> > > +  { vect_recog_add_halfing_narrow_pattern, "addhn" },
> > >    { vect_recog_over_widening_pattern, "over_widening" },
> > >    /* Must come after over_widening, which narrows the shift as much as
> > >       possible beforehand.  */
> > >
> > >
> > > --
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 2/5]middle-end: Add detection for add halfing and narrowing instruction

Reply via email to