On Fri, 9 May 2025, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Friday, May 9, 2025 8:31 AM
> > To: Pengfei Li <pengfei....@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; Richard Sandiford <richard.sandif...@arm.com>
> > Subject: Re: [PATCH v2] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, 
> > y) for
> > vectors
> > 
> > On Thu, 8 May 2025, Pengfei Li wrote:
> > 
> > > This patch folds vector expressions of the form (x + y) >> 1 into
> > > IFN_AVG_FLOOR (x, y), reducing instruction count on platforms that
> > > support averaging operations. For example, it can help improve the
> > > codegen on AArch64 from:
> > >   add     v0.4s, v0.4s, v31.4s
> > >   ushr    v0.4s, v0.4s, 1
> > > to:
> > >   uhadd   v0.4s, v0.4s, v31.4s
> > >
> > > As this folding is only valid when the most significant bit of each
> > > element in both x and y is known to be zero, this patch checks leading
> > > zero bits of elements in x and y, and extends get_nonzero_bits_1() to
> > > handle uniform vectors. When the input is a uniform vector, the function
> > > now returns the nonzero bits of its element.
> > >
> > > Additionally, this patch adds more checks to reject vector types in bit
> > > constant propagation (tree-bit-ccp), since tree-bit-ccp was designed for
> > > scalar values only, and the new vector logic in get_non_zero_bits_1()
> > > could lead to incorrect propagation results.
> > >
> > > Bootstrapped and tested on aarch64-linux-gnu and x86_64_linux_gnu.
> > >
> > > gcc/ChangeLog:
> > >
> > >   * match.pd: Add folding rule for vector average.
> > >   * tree-ssa-ccp.cc (get_default_value): Reject vector types.
> > >   (evaluate_stmt): Reject vector types.
> > >   * tree-ssanames.cc (get_nonzero_bits_1): Extend to handle
> > >   uniform vectors.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.target/aarch64/acle/uhadd_1.c: New test.
> > > ---
> > >  gcc/match.pd                                  |  9 +++++
> > >  .../gcc.target/aarch64/acle/uhadd_1.c         | 34 +++++++++++++++++++
> > >  gcc/tree-ssa-ccp.cc                           |  8 ++---
> > >  gcc/tree-ssanames.cc                          |  8 +++++
> > >  4 files changed, 55 insertions(+), 4 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/acle/uhadd_1.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index ab496d923cc..ddd16a10944 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -2177,6 +2177,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >      (view_convert (rshift (view_convert:ntype @0) @1))
> > >      (convert (rshift (convert:ntype @0) @1))))))
> > >
> > > + /* Fold ((x + y) >> 1 into IFN_AVG_FLOOR (x, y) if x and y are vectors 
> > > in
> > > +    which each element is known to have at least one leading zero bit.  
> > > */
> > > +(simplify
> > > + (rshift (plus:cs @0 @1) integer_onep)
> > > + (if (VECTOR_TYPE_P (type)
> > > +      && wi::clz (get_nonzero_bits (@0)) > 0
> > > +      && wi::clz (get_nonzero_bits (@1)) > 0)
> > > +  (IFN_AVG_FLOOR @0 @1)))
> > 
> > You need to check that IFN_AVG_FLOOR is supported using
> > direct_internal_fn_supported_p here.
> > 
> 
> Is this actually needed? The match.pd machinery already rejects it
> If not supported.
> 
> For gimple you end up in maybe_push_res_to_seq in gimple-match-exports.cc
> which calls build_call_internal which would refuse to build the call with 
> NULL as
> a result and stopping the simplification.

Ah, yeah - I forgot about this.

> For generic you end up in maybe_build_call_expr_loc in tree.cc which also
> fails with NULL_TREE if the IFN isn't supported.
> 
> I think the other usages of direct_internal_fn_supported_p are there because
> they predate these additions.  Or am I missing something?

Some are there because we decide between simplification variants I think.

Also consider you'd have the above pattern and a following

(simplify
 (rshift (plus:cs @0 @1) integer_onep)
 (if (VECTOR_TYPE_P (type)
     && wi::clz (get_nonzero_bits (@0)) > 0
     && wi::clz (get_nonzero_bits (@1)) > 0)
  (SOMETHING_ELSE @0 @1)))

then the first would match but ultimatively be rejected and the 2nd
also matching pattern would not be tried.  Unlikely in the case in
question but in general I think this could happen when the
maybe_push_res_to_seq that rejects the simplification is happening
from the caller of the simplification (the outermost expression
is open-coded in res_ops).

Richard.

> Regards,
> Tamar
> 
> > Otherwise this is OK with me.
> > 
> > Richard.
> > 
> > > +
> > >  /* Try to fold (type) X op CST -> (type) (X op ((type-x) CST))
> > >     when profitable.
> > >     For bitwise binary operations apply operand conversions to the
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/acle/uhadd_1.c
> > b/gcc/testsuite/gcc.target/aarch64/acle/uhadd_1.c
> > > new file mode 100644
> > > index 00000000000..f1748a199ad
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/acle/uhadd_1.c
> > > @@ -0,0 +1,34 @@
> > > +/* Test if SIMD fused unsigned halving adds are generated */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2" } */
> > > +
> > > +#include <arm_neon.h>
> > > +
> > > +#define FUSED_SIMD_UHADD(vectype, q, ts, mask) \
> > > +  vectype simd_uhadd ## q ## _ ## ts ## _1 (vectype a) \
> > > +  { \
> > > +    vectype v1 = vand ## q ## _ ## ts (a, vdup ## q ## _n_ ## ts 
> > > (mask)); \
> > > +    vectype v2 = vdup ## q ## _n_ ## ts (mask); \
> > > +    return vshr ## q ## _n_ ## ts (vadd ## q ## _ ## ts (v1, v2), 1); \
> > > +  } \
> > > +  \
> > > +  vectype simd_uhadd ## q ## _ ## ts ## _2 (vectype a, vectype b) \
> > > +  { \
> > > +    vectype v1 = vand ## q ## _ ## ts (a, vdup ## q ## _n_ ## ts 
> > > (mask)); \
> > > +    vectype v2 = vand ## q ## _ ## ts (b, vdup ## q ## _n_ ## ts 
> > > (mask)); \
> > > +    return vshr ## q ## _n_ ## ts (vadd ## q ## _ ## ts (v1, v2), 1); \
> > > +  }
> > > +
> > > +FUSED_SIMD_UHADD (uint8x8_t, , u8, 0x7f)
> > > +FUSED_SIMD_UHADD (uint8x16_t, q, u8, 0x7f)
> > > +FUSED_SIMD_UHADD (uint16x4_t, , u16, 0x7fff)
> > > +FUSED_SIMD_UHADD (uint16x8_t, q, u16, 0x7fff)
> > > +FUSED_SIMD_UHADD (uint32x2_t, , u32, 0x7fffffff)
> > > +FUSED_SIMD_UHADD (uint32x4_t, q, u32, 0x7fffffff)
> > > +
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.8b,} 2 } } */
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.16b,} 2 } } */
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.4h,} 2 } } */
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.8h,} 2 } } */
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.2s,} 2 } } */
> > > +/* { dg-final { scan-assembler-times {\tuhadd\tv[0-9]+\.4s,} 2 } } */
> > > diff --git a/gcc/tree-ssa-ccp.cc b/gcc/tree-ssa-ccp.cc
> > > index 8d2cbb384c4..3e0c75cf2be 100644
> > > --- a/gcc/tree-ssa-ccp.cc
> > > +++ b/gcc/tree-ssa-ccp.cc
> > > @@ -298,7 +298,7 @@ get_default_value (tree var)
> > >   {
> > >     val.lattice_val = VARYING;
> > >     val.mask = -1;
> > > -   if (flag_tree_bit_ccp)
> > > +   if (flag_tree_bit_ccp && !VECTOR_TYPE_P (TREE_TYPE (var)))
> > >       {
> > >         wide_int nonzero_bits = get_nonzero_bits (var);
> > >         tree value;
> > > @@ -2491,11 +2491,11 @@ evaluate_stmt (gimple *stmt)
> > >        is_constant = (val.lattice_val == CONSTANT);
> > >      }
> > >
> > > +  tree lhs = gimple_get_lhs (stmt);
> > >    if (flag_tree_bit_ccp
> > > +      && lhs && TREE_CODE (lhs) == SSA_NAME && !VECTOR_TYPE_P (TREE_TYPE
> > (lhs))
> > >        && ((is_constant && TREE_CODE (val.value) == INTEGER_CST)
> > > -   || !is_constant)
> > > -      && gimple_get_lhs (stmt)
> > > -      && TREE_CODE (gimple_get_lhs (stmt)) == SSA_NAME)
> > > +   || !is_constant))
> > >      {
> > >        tree lhs = gimple_get_lhs (stmt);
> > >        wide_int nonzero_bits = get_nonzero_bits (lhs);
> > > diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
> > > index de7b9b79f94..99613411624 100644
> > > --- a/gcc/tree-ssanames.cc
> > > +++ b/gcc/tree-ssanames.cc
> > > @@ -508,6 +508,14 @@ get_nonzero_bits_1 (const_tree name)
> > >    /* Use element_precision instead of TYPE_PRECISION so complex and
> > >       vector types get a non-zero precision.  */
> > >    unsigned int precision = element_precision (TREE_TYPE (name));
> > > +
> > > +  if (VECTOR_TYPE_P (TREE_TYPE (name)))
> > > +    {
> > > +      tree elem = uniform_vector_p (name);
> > > +      if (elem)
> > > + return get_nonzero_bits_1 (elem);
> > > +    }
> > > +
> > >    if (TREE_CODE (name) != SSA_NAME)
> > >      return wi::shwi (-1, precision);
> > >
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to