Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Friday, October 11, 2024 7:52 AM
>> To: Richard Sandiford <richard.sandif...@arm.com>
>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Richard
>> Earnshaw <richard.earns...@arm.com>; Kyrylo Tkachov
>> <ktkac...@nvidia.com>; Tamar Christina <tamar.christ...@arm.com>
>> Subject: Re: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector
>> reductions
>> 
>> On Thu, 10 Oct 2024, Richard Sandiford wrote:
>> 
>> > Jennifer Schmitz <jschm...@nvidia.com> writes:
>> > > This patch implements the optabs reduc_and_scal_<mode>,
>> > > reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for ASIMD modes V8QI,
>> > > V16QI, V4HI, and V8HI for TARGET_SIMD to improve codegen for bitwise
>> logical
>> > > vector reduction operations.
>> > > Previously, either only vector registers or only general purpose 
>> > > registers (GPR)
>> > > were used. Now, vector registers are used for the reduction from 128 to 
>> > > 64
>> bits;
>> > > 64-bit GPR are used for the reduction from 64 to 32 bits; and 32-bit GPR 
>> > > are
>> used
>> > > for the rest of the reduction steps.
>> > >
>> > > For example, the test case (V8HI)
>> > > int16_t foo (int16_t *a)
>> > > {
>> > >   int16_t b = -1;
>> > >   for (int i = 0; i < 8; ++i)
>> > >     b &= a[i];
>> > >   return b;
>> > > }
>> > >
>> > > was previously compiled to (-O2):
>> > > foo:
>> > >  ldr     q0, [x0]
>> > >  movi    v30.4s, 0
>> > >  ext     v29.16b, v0.16b, v30.16b, #8
>> > >  and     v29.16b, v29.16b, v0.16b
>> > >  ext     v31.16b, v29.16b, v30.16b, #4
>> > >  and     v31.16b, v31.16b, v29.16b
>> > >  ext     v30.16b, v31.16b, v30.16b, #2
>> > >  and     v30.16b, v30.16b, v31.16b
>> > >  umov    w0, v30.h[0]
>> > >  ret
>> > >
>> > > With patch, it is compiled to:
>> > > foo:
>> > >  ldr     q31, [x0]
>> > >  ext     v30.16b, v31.16b, v31.16b, #8
>> > >  and     v31.8b, v30.8b, v31.8b
>> > >  fmov    x0, d31
>> > >  and     x0, x0, x0, lsr 32
>> > >  and     w0, w0, w0, lsr 16
>> > >  ret
>> > >
>> > > For modes V4SI and V2DI, the pattern was not implemented, because the
>> > > current codegen (using only base instructions) is already efficient.
>> > >
>> > > Note that the PR initially suggested to use SVE reduction ops. However,
>> > > they have higher latency than the proposed sequence, which is why using
>> > > neon and base instructions is preferable.
>> > >
>> > > Test cases were added for 8/16-bit integers for all implemented modes 
>> > > and all
>> > > three operations to check the produced assembly.
>> > >
>> > > We also added [istarget aarch64*-*-*] to the selector vect_logical_reduc,
>> > > because for aarch64 vector types, either the logical reduction optabs are
>> > > implemented or the codegen for reduction operations is good as it is.
>> > > This was motivated by failure of a scan-tree-dump directive in the test 
>> > > cases
>> > > gcc.dg/vect/vect-reduc-or_1.c and gcc.dg/vect/vect-reduc-or_2.c.
>> > >
>> > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no
>> regression.
>> > > OK for mainline?
>> > >
>> > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>> > >
>> > > gcc/
>> > >  PR target/113816
>> > >  * config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>):
>> > >  Implement for logical bitwise operations for VDQV_E.
>> > >
>> > > gcc/testsuite/
>> > >  PR target/113816
>> > >  * lib/target-supports.exp (vect_logical_reduc): Add aarch64*.
>> > >  * gcc.target/aarch64/simd/logical_reduc.c: New test.
>> > >  * gcc.target/aarch64/vect-reduc-or_1.c: Adjust expected outcome.
>> > > ---
>> > >  gcc/config/aarch64/aarch64-simd.md            |  55 +++++
>> > >  .../gcc.target/aarch64/simd/logical_reduc.c   | 208 ++++++++++++++++++
>> > >  .../gcc.target/aarch64/vect-reduc-or_1.c      |   2 +-
>> > >  gcc/testsuite/lib/target-supports.exp         |   4 +-
>> > >  4 files changed, 267 insertions(+), 2 deletions(-)
>> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
>> > >
>> > > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> > > index 23c03a96371..00286b8b020 100644
>> > > --- a/gcc/config/aarch64/aarch64-simd.md
>> > > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > > @@ -3608,6 +3608,61 @@
>> > >    }
>> > >  )
>> > >
>> > > +;; Emit a sequence for bitwise logical reductions over vectors for 
>> > > V8QI, V16QI,
>> > > +;; V4HI, and V8HI modes.  The reduction is achieved by iteratively 
>> > > operating
>> > > +;; on the two halves of the input.
>> > > +;; If the input has 128 bits, the first operation is performed in vector
>> > > +;; registers.  From 64 bits down, the reduction steps are performed in 
>> > > general
>> > > +;; purpose registers.
>> > > +;; For example, for V8HI and operation AND, the intended sequence is:
>> > > +;; EXT      v1.16b, v0.16b, v0.16b, #8
>> > > +;; AND      v0.8b, v1.8b, v0.8b
>> > > +;; FMOV     x0, d0
>> > > +;; AND      x0, x0, x0, 32
>> > > +;; AND      w0, w0, w0, 16
>> > > +;;
>> > > +;; For V8QI and operation AND, the sequence is:
>> > > +;; AND      x0, x0, x0, lsr 32
>> > > +;; AND      w0, w0, w0, lsr, 16
>> > > +;; AND      w0, w0, w0, lsr, 8
>> > > +
>> > > +(define_expand "reduc_<optab>_scal_<mode>"
>> > > + [(match_operand:<VEL> 0 "register_operand")
>> > > +  (LOGICAL:VDQV_E (match_operand:VDQV_E 1 "register_operand"))]
>> > > +  "TARGET_SIMD"
>> > > +  {
>> > > +    rtx dst = operands[1];
>> > > +    rtx tdi = gen_reg_rtx (DImode);
>> > > +    rtx tsi = lowpart_subreg (SImode, tdi, DImode);
>> > > +    rtx op1_lo;
>> > > +    if (known_eq (GET_MODE_SIZE (<MODE>mode), 16))
>> > > +      {
>> > > +        rtx t0 = gen_reg_rtx (<MODE>mode);
>> > > +        rtx t1 = gen_reg_rtx (DImode);
>> > > +        rtx t2 = gen_reg_rtx (DImode);
>> > > +        rtx idx = GEN_INT (8 / GET_MODE_UNIT_SIZE (<MODE>mode));
>> > > +        emit_insn (gen_aarch64_ext<mode> (t0, dst, dst, idx));
>> > > +        op1_lo = lowpart_subreg (V2DImode, dst, <MODE>mode);
>> > > +        rtx t0_lo = lowpart_subreg (V2DImode, t0, <MODE>mode);
>> > > +        emit_insn (gen_aarch64_get_lanev2di (t1, op1_lo, GEN_INT (0)));
>> > > +        emit_insn (gen_aarch64_get_lanev2di (t2, t0_lo, GEN_INT (0)));
>> > > +        emit_insn (gen_<optab>di3 (t1, t1, t2));
>> > > +        emit_move_insn (tdi, t1);
>> > > +      }
>> > > +    else
>> > > +      {
>> > > +        op1_lo = lowpart_subreg (DImode, dst, <MODE>mode);
>> > > +        emit_move_insn (tdi, op1_lo);
>> > > +      }
>> > > +    emit_insn (gen_<optab>_lshrdi3 (tdi, tdi, GEN_INT (32), tdi));
>> > > +    emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (16), tsi));
>> > > +    if (known_eq (GET_MODE_UNIT_BITSIZE (<MODE>mode), 8))
>> > > +      emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (8), tsi));
>> > > +    emit_move_insn (operands[0], lowpart_subreg (<VEL>mode, tsi, 
>> > > SImode));
>> > > +    DONE;
>> > > +  }
>> > > +)
>> >
>> > Sorry to be awkward, but I've got mixed feelings about this.
>> > The sequence produced looks good.  But we're not defining this
>> > optab because the target has a special instruction for implementing
>> > the operation.  We're just implementing it to tweak the loop emitted
>> > by vect_create_epilog_for_reduction.  Maybe we should change that loop
>> > instead, perhaps with a target hook?
>> 
>> So does it currently not work because there's no v8qi, v4qi or v2qi
>> vector modes?  I'll note there already _is_
>> targetm.vectorize.split_reduction (mode), but it expects a vector
>> mode as result and it doesn't get the operation in.
>> 
>
> The point of the reduction is that they shouldn't be done on the vector side.
> So even if we have v4qi and v2qi the reduction shouldn't be done there as
> the latency and throughput for these sub 64-bit operations on the vector
> side is much higher and so should be done on the integer side. i.e. these
> should be done on DI and SI.
>
> This is why I disagree with Richard that this can be done generically for all
> targets. The decomposition takes advantage of Arm's fused shifted logical
> operations on GPR.  This makes them really cheap to do on integer.

Wouldn't it still be a win without that though?  Having the fused operation
gives us a ~4x improvement in latency on a typical core (EXT + vector AND
-> fused scalar AND).  But the unfused operation would still be a ~2x
improvement.

> Richard brought up AArch32 as another example. I don't agree there either
> Because AArch32's register file overlap means you don't need shifts.
> The first step reduction can just be done by accessing the registers
> as even/odd pairs on a smaller mode.
>
> To get this codegen generically the targets must supports modes tie-able
> between the vector and smaller vector size. i.e.  BIT_FIELD_REF on the
> vector mode to a smaller mode must be cheap or free.
>
> The decision for when you do the transfer from SIMD to GPR is highly
> target specific, and the how as well.
>
>> I would expect that we can improve on this existing machinery
>> instead of inventing a new one.
>> 
>
> But we're not. We're just implementing the optab to the machinery
> that already exists.

I don't think this kind of expansion is what optabs are there for though.
I suppose it was in the pre-gimple days, when all optimisation was done
on RTL.  (Although of course that also predates vectorisation.)  But now,
I think the purpose of optabs is to advertise what the target can do via
some form of direct h/w support.

The original patch (using ANDV) was an ideal use of optabs.  But I think
the request to use a tweaked version of the current open-coded reduction
scheme, although a good suggestion, also changes the way that we should
go about it.

> I think we're just over engineering a new contributor's patch. Even if
> we think there is some merit to do it generically, I don't see why we
> shouldn't take this patch today.

Sorry if it comes across that way.

Thanks,
Richard

Reply via email to