RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Richard Biener Wed, 03 Sep 2025 00:13:57 -0700

On Tue, 2 Sep 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, September 2, 2025 3:45 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of
> > inclusive OR when reducing comparison values
> > 
> > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Tuesday, September 2, 2025 3:08 PM
> > > > To: Tamar Christina <[email protected]>
> > > > Cc: [email protected]; nd <[email protected]>
> > > > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression 
> > > > instead of
> > > > inclusive OR when reducing comparison values
> > > >
> > > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <[email protected]>
> > > > > > Sent: Tuesday, September 2, 2025 1:30 PM
> > > > > > To: Tamar Christina <[email protected]>
> > > > > > Cc: [email protected]; nd <[email protected]>
> > > > > > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression 
> > > > > > instead
> > of
> > > > > > inclusive OR when reducing comparison values
> > > > > >
> > > > > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > > > > >
> > > > > > > Given a sequence such as
> > > > > > >
> > > > > > > int foo ()
> > > > > > > {
> > > > > > > #pragma GCC unroll 4
> > > > > > >   for (int i = 0; i < N; i++)
> > > > > > >     if (a[i] == 124)
> > > > > > >       return 1;
> > > > > > >
> > > > > > >   return 0;
> > > > > > > }
> > > > > > >
> > > > > > > where a[i] is long long, we will unroll the loop and use an OR 
> > > > > > > reduction for
> > > > > > > early break on Adv. SIMD.  Afterwards the sequence is followed by 
> > > > > > > a
> > > > compression
> > > > > > > sequence to compress the 128-bit vectors into 64-bits for use by 
> > > > > > > the
> > branch.
> > > > > > >
> > > > > > > However if we have support for add halving and narrowing then we 
> > > > > > > can
> > > > instead
> > > > > > of
> > > > > > > using an OR, use an ADDHN which will do the combining and 
> > > > > > > narrowing.
> > > > > > >
> > > > > > > Note that for now I only do the last OR, however if we have more 
> > > > > > > than
> > one
> > > > level
> > > > > > > of unrolling we could technically chain them.  I will revisit 
> > > > > > > this in another
> > > > > > > up coming early break series, however an unroll of 2 is fairly 
> > > > > > > common.
> > > > > > >
> > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > > > -m32, -m64 and no issues and about a 10% improvements
> > > > > > > in this sequence for Adv. SIMD.
> > > > > > >
> > > > > > > Ok for master?
> > > > > >
> > > > > > Hmm, so you are replacing the last bitwise OR with a
> > > > > > addhn which produces a "smaller" vector.  So like
> > > > > >
> > > > > >  V4SI tem = V4SI | V4SI;
> > > > > >  if (tem != 0)
> > > > > >
> > > > > > ->
> > > > > >
> > > > > >  V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
> > > > > >  if (tem != 0)
> > > > > >
> > > > > > whatever 'halving' now stands for (isn't that .VEC_ADD_HIGH_NARROW?)
> > > > > >
> > > > >
> > > > > Yeah, but it retrieved the high half, so open to suggestion but I 
> > > > > think any
> > name
> > > > > would be confusion..
> > > > >
> > > > > > I can't see how that's in any way faster?  (the aarch64 testcases
> > > > > > unfortunately stop matching after the addhn)
> > > > > >
> > > > >
> > > > > Which is intentional.
> > > > >
> > > > > The original code with the ORR is
> > > > >
> > > > >         ldp     q31, q30, [x0]
> > > > >         cmeq    v31.2d, v31.2d, v29.2d
> > > > >         cmeq    v30.2d, v30.2d, v29.2d
> > > > >         orr     v31.16b, v31.16b, v30.16b
> > > > >         umaxp   v31.4s, v31.4s, v31.4s
> > > > >         fmov    x3, d31
> > > > >
> > > > > because the result of the ORR is a 128-bit vector it needs to be 
> > > > > compressed
> > > > > into 64 bits to be transferred to GPR so the != 0 can be performed.
> > > > >
> > > > > ADDHN does the combination and compression in one step. i.e.
> > > > >
> > > > > Orr + umaxp -> addhn.
> > > >
> > > > Ah, I see.  So AdvSIMD lacks a ptest, and instead you go to gpr.
> > > > The above code does a max reduction and the fmov moves the
> > > > scalar reduction result to a GPR?  But with addhn you move
> > > > the whole (64bit) vector reg to a GPR?
> > > >
> > >
> > > Indeed.
> > >
> > > > It seems to me that on the vectorizer side it's not so interesting
> > > > to know the target can do addhn but that the target can't do
> > > > a {u,}cmpv4si with EQ?  That is, without the patch the vectorizer
> > > > generates a GIMPLE_COND that isn't supported by the target?
> > > >
> > > > We check for cbranch, so currently you say you can do it but emulate
> > > > it with umaxp + fmov?
> > >
> > > Indeed.
> > >
> > > >
> > > > What do you do when there's only one V4SImode vector?  You
> > > > could pack (truncate) that to V4HImode, right?  Aka
> > > > vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is
> > > > then 'x'?
> > >
> > > For only one V4SImode vector we use UMAXP still but with itself,
> > > It essentially throws away half the lanes of the result, but that's ok
> > > Because for a cbranch all we care about is whether any is set or all
> > > is zero.  We use UMAXP because that works for vectors of bytes too
> > > whereas narrowing wouldn't.
> > 
> > Ah, looked up umaxp and it's a concat + reduce adjacent lanes to
> > one with MAX.  We don't have an optab scheme for such instruction
> > either ;)  I think the closest are the SAD_EXPR likes, that
> > reduce the number of lanes because they are widening, but those
> > have one input only.  umaxp is sth like a reduc_umax_evenodd,
> > one could imagine a reduc_umax_hilo that reduces { a0, a1, a2, a3 }
> > { b0, b1, b2, b3 } as { umax (a0, b0), umax (a1, b1), umax (a2,b2),
> > umax (a3, b3) } instead.  That said, I can see how addhn is
> > useful here.
> > 
> > > However when SVE is available (just wasn't chosen due to costing)
> > > the goal is to use the SVE comparison predicated to 128-bits instead.
> > >
> > > This is where my up coming patch for vec_cbranch_any and
> > > vec_cbranch_all comes in.  With only one comparison we can replace
> > > it with an SVE compare + branch, removing the need for the reduction.
> > 
> > Hmm, it all feels like somewhat of a delicate target costing thing
> > to me.
> > 
> > > >
> > > > > > Also the inputs are vector bools(?), so you should V_C_E them to
> > > > > > data vectors before "adding" them.  And check that they have
> > > > > > a vector mode that's not VnBImode for which I guess the addhn
> > > > > > semantics wouldn't be necessarily good enough.
> > > > >
> > > > > ADDHN can't be used for SVE (and so the optab isn't implemented on SVE
> > > > modes)
> > > > > because SVE's version is even/odd.  But for SVE we also don't want 
> > > > > this
> > codegen
> > > > > because SVE can branch on the result of the data compare.  So we 
> > > > > don't want
> > the
> > > > > intermediate forced compression.  So this is strictly for Adv. SIMD.
> > > > >
> > > > > >
> > > > > > How would you scale this to workset.length () > 2?  I suppose
> > > > > > for an even number reduce to the half element size first, for
> > > > > > odd you could make it even by first reducing two vectors with IOR?
> > > > > > If small, either check for another narrowing addhn operation or
> > > > > > continue with IOR?
> > > > > >
> > > > >
> > > > > Because the instruction can't work on bytes,  having > 2 just uses ORR
> > > > > Until we have == 2 and then ADDHN.  You could use ADDHN for the
> > > > > Intermediate steps, but ADDHN hits a limit when you reach bytes.
> > > > >
> > > > > However one big benefit of using the ADDHN even in > 2 cases is that
> > > > > it prevents reassoc from breaking the ORR order we created in the
> > > > > vectorizer as it can't reassociate them back to a linear form as it 
> > > > > does
> > > > > today.
> > > > >
> > > > > And the reason we can't match the ADDHN in the backend is that in
> > > > > order for us to know that the inputs are Boolean vectors we also have
> > > > > to match the compares.  This means that the chain is longer than what
> > > > > combine tries since it has to match everything including the 
> > > > > if_then_else
> > > > > and the reduction to set the CC.
> > > > >
> > > > > > That said, I still fail to see how addhn reduces the critical
> > > > > > latency?
> > > > > >
> > > > >
> > > > > Because it replaces 2 instruction on the critical reduction path with 
> > > > > 1
> > > > > that is half the latency of the two it replaced.  See example above.
> > > >
> > > > So it's good enough to indeed combine the last two elements like your
> > > > patch does.
> > > >
> > > > That said, I still wonder about the trigger - it shouldn't be
> > > > availability of the instruction as I'd think a addhn should be
> > > > not cheaper than a simple bitwise OR.  Instead it's that
> > > > cbranch on the wider vector isn't available?
> > >
> > > The addhn is actually the same latency/throughput as ORR as they're both 
> > > simple
> > > vector ALU operations on all cores.  But yeah the reason this is 
> > > beneficial
> > > is because of the reduction. So from that point of view it is beneficial 
> > > to
> > > always use if available.
> > 
> > Is it?  At least only when cbranch on that smaller mode is available?
> 
> Yes, it's what we use in our glibc routines like chrchr etc. which are inline 
> assembly.
> 
> > 
> > > The reason why I don't think the vectorizer should generate anything
> > > different than the cbranch is that as mentioned above when SVE is
> > > available (which is on all modern Arm produced cores) we can replace
> > > the cbranch with an SVE compare.
> > >
> > > So testing for cbranch is temporary until it's replaced with testing for
> > > Vec_cbranch_any and vec_cbranch_all (and deprecate cbranch) which
> > > allows us more flexibility in the back end.
> > 
> > Right.
> > 
> > Still an unconditional use of addhn when available looks wrong to me,
> > why'd we have the cbranch check then, anyway?
> 
> Because the result of the addhn is still a vector, and the cbranch check is 
> there
> to ask if the target can do the vector comparison and branch.  The vectorizer 
> doesn't
> particularly care how it does it though.


Huh, but we ask

      if (direct_optab_handler (cbranch_optab, mode) == CODE_FOR_nothing)

in this case for V4SI, but now with availability of addhn you
instead generate a branch on V4HI.  So what's the point of asking for
V4SI when you use V4HI in the end?

You should check for addhn availability plus verify you can cbranch
on the half-size vector mode I think (and otherwise not use addhn).

> Some targets need a reduction of sorts (Adv. SIMD), we just utilize the fact 
> that
> we can move the lower 64-bit of the vector as a bit pattern in one go to GPR.
> 
> Other targets don't need any of this and just branch on the CC flags already 
> set
> (SVE)
> 
> Other targets have to reduce to scalar using some in-order reduction or 
> similar.
> 
> And some targets may just not be able to.  The cbranch is there to abstract 
> these
> away.
> 
> The ADDHN is essentially asking whether you prefer a vector of Booleans of 
> 64-bit
> or 128-bits for the reduction of >= 2 compares with the expectation that the 
> vector
> of 64-bit will not be more expensive than that of 128-bits to use.
> 
> I mean I could instead add a target hook that asks the target how it wants to 
> combine
> the two elements? But that feels like an abstraction that won't really be 
> used by anyone
> else..

Just properly test we can cbranch on the actually used mode?

Richard.

> Thanks,
> Tamar
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Tamar
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >   * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
> > > > > > >   * doc/generic.texi: Document it.
> > > > > > >   * optabs.def (vec_addh_narrow): New.
> > > > > > >   * doc/md.texi: Document it.
> > > > > > >   * tree-vect-stmts.cc (vectorizable_early_exit): Use addhn if
> > supported.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >   * gcc.target/aarch64/vect-early-break-addhn_1.c: New test.
> > > > > > >   * gcc.target/aarch64/vect-early-break-addhn_2.c: New test.
> > > > > > >   * gcc.target/aarch64/vect-early-break-addhn_3.c: New test.
> > > > > > >   * gcc.target/aarch64/vect-early-break-addhn_4.c: New test.
> > > > > > >
> > > > > > > ---
> > > > > > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > > > > > index
> > > > > >
> > > >
> > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9
> > > > > > a885d9d218d9af 100644
> > > > > > > --- a/gcc/doc/generic.texi
> > > > > > > +++ b/gcc/doc/generic.texi
> > > > > > > @@ -1834,6 +1834,7 @@ a value from @code{enum annot_expr_kind},
> > the
> > > > > > third is an @code{INTEGER_CST}.
> > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_LO
> > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > > > > > +@tindex IFN_VEC_ADD_HALVING_NARROW
> > > > > > >  @tindex VEC_UNPACK_HI_EXPR
> > > > > > >  @tindex VEC_UNPACK_LO_EXPR
> > > > > > >  @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > > > > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In the
> > case
> > > > of
> > > > > > >  vector are subtracted from the odd @code{N/2} of the first to 
> > > > > > > produce
> > the
> > > > > > >  vector of @code{N/2} subtractions.
> > > > > > >
> > > > > > > +@item IFN_VEC_ADD_HALVING_NARROW
> > > > > > > +This internal function performs an addition of two input vectors,
> > > > > > > +then extracts the most significant half of each result element 
> > > > > > > and
> > > > > > > +narrows it back to the original element width.
> > > > > > > +
> > > > > > > +Concretely, it computes:
> > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > > > +
> > > > > > > +where @code{bits(a)} is the width in bits of each input element.
> > > > > > > +
> > > > > > > +Its operands are vectors containing the same number of elements
> > > > (@code{N})
> > > > > > > +of the same integral type.  The result is a vector of length 
> > > > > > > @code{N}, with
> > > > > > > +elements of an integral type whose size is half that of the 
> > > > > > > input element
> > > > > > > +type.
> > > > > > > +
> > > > > > > +This operation currently only used for early break result 
> > > > > > > compression
> > when
> > > > the
> > > > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > > > +
> > > > > > >  @item VEC_UNPACK_HI_EXPR
> > > > > > >  @itemx VEC_UNPACK_LO_EXPR
> > > > > > >  These nodes represent unpacking of the high and low parts of the 
> > > > > > > input
> > > > vector,
> > > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > > > index
> > > > > >
> > > >
> > aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9
> > > > > > b28ba8d52e5d464 100644
> > > > > > > --- a/gcc/doc/md.texi
> > > > > > > +++ b/gcc/doc/md.texi
> > > > > > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements of
> > size
> > > > > > S@.  Find the absolute
> > > > > > >  difference between operands 1 and 2 and widen the resulting 
> > > > > > > elements.
> > > > > > >  Put the N/2 results of size 2*S in the output vector (operand 0).
> > > > > > >
> > > > > > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> > > > > > > +@item @samp{vec_addh_narrow@var{m}}
> > > > > > > +Signed or unsigned addition of two input vectors, then extracts 
> > > > > > > the
> > > > > > > +most significant half of each result element and narrows it back 
> > > > > > > to the
> > > > > > > +original element width.
> > > > > > > +
> > > > > > > +Concretely, it computes:
> > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > > > +
> > > > > > > +where @code{bits(a)} is the width in bits of each input element.
> > > > > > > +
> > > > > > > +Its operands (@code{1} and @code{2}) are vectors containing the 
> > > > > > > same
> > > > > > number
> > > > > > > +of signed or unsigned integral elements (@code{N}) of size 
> > > > > > > @code{S}.
> > The
> > > > > > > +result (operand @code{0}) is a vector of length @code{N}, with 
> > > > > > > elements
> > of
> > > > > > > +an integral type whose size is half that of @code{S}.
> > > > > > > +
> > > > > > > +This operation currently only used for early break result 
> > > > > > > compression
> > when
> > > > the
> > > > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > > > +
> > > > > > >  @cindex @code{vec_addsub@var{m}3} instruction pattern
> > > > > > >  @item @samp{vec_addsub@var{m}3}
> > > > > > >  Alternating subtract, add with even lanes doing subtract and odd
> > > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > > > > index
> > > > > >
> > > >
> > d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b
> > > > > > 31d0abc9adb67867 100644
> > > > > > > --- a/gcc/internal-fn.def
> > > > > > > +++ b/gcc/internal-fn.def
> > > > > > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN
> > > > (COMPLEX_ADD_ROT270,
> > > > > > ECF_CONST, cadd270, binary)
> > > > > > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> > > > > > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST,
> > cmul_conj,
> > > > > > binary)
> > > > > > >  DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub,
> > binary)
> > > > > > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, ECF_CONST
> > |
> > > > > > ECF_NOTHROW,
> > > > > > > +                vec_addh_narrow, binary)
> > > > > > >  DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
> > > > > > >                           ECF_CONST | ECF_NOTHROW,
> > > > > > >                           first,
> > > > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > > > index
> > > > > >
> > > >
> > 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797
> > > > > > 2db81b542b32c9eb8 100644
> > > > > > > --- a/gcc/optabs.def
> > > > > > > +++ b/gcc/optabs.def
> > > > > > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > > > > > "vec_widen_uabd_hi_$a")
> > > > > > >  OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > > > > > >  OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > > > > > >  OPTAB_D (vec_widen_uabd_even_optab, "vec_widen_uabd_even_$a")
> > > > > > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
> > > > > > >  OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > > > > > >  OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > > > > > >  OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > > > > > diff --git 
> > > > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8
> > > > > > b063e418a75a23c525d
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > > > @@ -0,0 +1,33 @@
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > -std=c99" } */
> > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > +
> > > > > > > +#define TYPE int
> > > > > > > +#define N 800
> > > > > > > +
> > > > > > > +#pragma GCC target "+nosve"
> > > > > > > +
> > > > > > > +TYPE a[N];
> > > > > > > +
> > > > > > > +/*
> > > > > > > +** foo:
> > > > > > > +**       ...
> > > > > > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > +**       cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > +**       cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > +**       addhn   v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > +**       fmov    x[0-9]+, d[0-9]+
> > > > > > > +**       ...
> > > > > > > +*/
> > > > > > > +
> > > > > > > +int foo ()
> > > > > > > +{
> > > > > > > +#pragma GCC unroll 8
> > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > +    if (a[i] == 124)
> > > > > > > +      return 1;
> > > > > > > +
> > > > > > > +  return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } 
> > > > > > > }
> > */
> > > > > > > diff --git 
> > > > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80
> > > > > > 5e59188eb8155cb5f06
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > > > @@ -0,0 +1,33 @@
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > -std=c99" } */
> > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > +
> > > > > > > +#define TYPE long long
> > > > > > > +#define N 800
> > > > > > > +
> > > > > > > +#pragma GCC target "+nosve"
> > > > > > > +
> > > > > > > +TYPE a[N];
> > > > > > > +
> > > > > > > +/*
> > > > > > > +** foo:
> > > > > > > +**       ...
> > > > > > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > +**       cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > +**       cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > +**       addhn   v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > +**       fmov    x[0-9]+, d[0-9]+
> > > > > > > +**       ...
> > > > > > > +*/
> > > > > > > +
> > > > > > > +int foo ()
> > > > > > > +{
> > > > > > > +#pragma GCC unroll 4
> > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > +    if (a[i] == 124)
> > > > > > > +      return 1;
> > > > > > > +
> > > > > > > +  return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } 
> > > > > > > }
> > */
> > > > > > > diff --git 
> > > > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8
> > > > > > dbe98c79713eaf5607
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > > > @@ -0,0 +1,33 @@
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > -std=c99" } */
> > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > +
> > > > > > > +#define TYPE short
> > > > > > > +#define N 800
> > > > > > > +
> > > > > > > +#pragma GCC target "+nosve"
> > > > > > > +
> > > > > > > +TYPE a[N];
> > > > > > > +
> > > > > > > +/*
> > > > > > > +** foo:
> > > > > > > +**       ...
> > > > > > > +**       ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > +**       cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > +**       cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > +**       addhn   v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > +**       fmov    x[0-9]+, d[0-9]+
> > > > > > > +**       ...
> > > > > > > +*/
> > > > > > > +
> > > > > > > +int foo ()
> > > > > > > +{
> > > > > > > +#pragma GCC unroll 16
> > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > +    if (a[i] == 124)
> > > > > > > +      return 1;
> > > > > > > +
> > > > > > > +  return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" } 
> > > > > > > }
> > */
> > > > > > > diff --git 
> > > > > > > a/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > >
> > > >
> > 0000000000000000000000000000000000000000..8ad42b22024479283d681
> > > > > > 4d815ef1dce411d1c72
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > > > @@ -0,0 +1,21 @@
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > -std=c99" } */
> > > > > > > +
> > > > > > > +#define TYPE char
> > > > > > > +#define N 800
> > > > > > > +
> > > > > > > +#pragma GCC target "+nosve"
> > > > > > > +
> > > > > > > +TYPE a[N];
> > > > > > > +
> > > > > > > +int foo ()
> > > > > > > +{
> > > > > > > +#pragma GCC unroll 32
> > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > +    if (a[i] == 124)
> > > > > > > +      return 1;
> > > > > > > +
> > > > > > > +  return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW"
> > "vect" } }
> > > > */
> > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > > index
> > > > > >
> > > >
> > 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54
> > > > > > 4c6874d7bec999a8 100644
> > > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info
> > > > loop_vinfo,
> > > > > > stmt_vec_info stmt_info,
> > > > > > >    gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt 
> > > > > > > (stmt_info));
> > > > > > >    gcond *cond_stmt = as_a <gcond *>(orig_stmt);
> > > > > > >
> > > > > > > -  tree cst = build_zero_cst (vectype);
> > > > > > > +  tree vectype_out = vectype;
> > > > > > >    auto bb = gimple_bb (cond_stmt);
> > > > > > >    edge exit_true_edge = EDGE_SUCC (bb, 0);
> > > > > > >    if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> > > > > > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info
> > > > > > loop_vinfo, stmt_vec_info stmt_info,
> > > > > > >        else
> > > > > > >   workset.splice (stmts);
> > > > > > >
> > > > > > > +      /* See if we support ADDHN and use that for the reduction. 
> > > > > > >  */
> > > > > > > +      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> > > > > > > +      bool addhn_supported_p
> > > > > > > + = direct_internal_fn_supported_p (ifn, vectype,
> > OPTIMIZE_FOR_SPEED);
> > > > > > > +      tree narrow_type = NULL_TREE;
> > > > > > > +      if (addhn_supported_p)
> > > > > > > + {
> > > > > > > +   /* Calculate the narrowing type for the result.  */
> > > > > > > +   auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) / 2;
> > > > > > > +   auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> > > > > > > +   tree itype = build_nonstandard_integer_type (halfprec,
> > unsignedp);
> > > > > > > +   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > > > > +   tree tmp_type = build_vector_type (itype, nunits);
> > > > > > > +   narrow_type = truth_type_for (tmp_type);
> > > > > > > + }
> > > > > > > +
> > > > > > >        while (workset.length () > 1)
> > > > > > >   {
> > > > > > > -   new_temp = make_temp_ssa_name (vectype, NULL,
> > "vexit_reduc");
> > > > > > >     tree arg0 = workset.pop ();
> > > > > > >     tree arg1 = workset.pop ();
> > > > > > > -   new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR,
> > arg0, arg1);
> > > > > > > +   if (addhn_supported_p && workset.length () == 0)
> > > > > > > +     {
> > > > > > > +       new_stmt = gimple_build_call_internal (ifn, 2, arg0, 
> > > > > > > arg1);
> > > > > > > +       vectype_out = narrow_type;
> > > > > > > +       new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > > > "vexit_reduc");
> > > > > > > +       gimple_call_set_lhs (as_a <gcall *> (new_stmt), new_temp);
> > > > > > > +       gimple_call_set_nothrow (as_a <gcall *> (new_stmt), true);
> > > > > > > +     }
> > > > > > > +   else
> > > > > > > +     {
> > > > > > > +       new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > > > "vexit_reduc");
> > > > > > > +       new_stmt
> > > > > > > +         = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0,
> > arg1);
> > > > > > > +     }
> > > > > > >     vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > > > > > >                                  &cond_gsi);
> > > > > > >     workset.quick_insert (0, new_temp);
> > > > > > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info
> > > > loop_vinfo,
> > > > > > stmt_vec_info stmt_info,
> > > > > > >
> > > > > > >    gcc_assert (new_temp);
> > > > > > >
> > > > > > > +  tree cst = build_zero_cst (vectype_out);
> > > > > > >    gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, cst);
> > > > > > >    update_stmt (orig_stmt);
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <[email protected]>
> > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Reply via email to