Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Biener <rguent...@suse.de> >> Sent: Friday, October 11, 2024 7:52 AM >> To: Richard Sandiford <richard.sandif...@arm.com> >> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Richard >> Earnshaw <richard.earns...@arm.com>; Kyrylo Tkachov >> <ktkac...@nvidia.com>; Tamar Christina <tamar.christ...@arm.com> >> Subject: Re: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector >> reductions >> >> On Thu, 10 Oct 2024, Richard Sandiford wrote: >> >> > Jennifer Schmitz <jschm...@nvidia.com> writes: >> > > This patch implements the optabs reduc_and_scal_<mode>, >> > > reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for ASIMD modes V8QI, >> > > V16QI, V4HI, and V8HI for TARGET_SIMD to improve codegen for bitwise >> logical >> > > vector reduction operations. >> > > Previously, either only vector registers or only general purpose >> > > registers (GPR) >> > > were used. Now, vector registers are used for the reduction from 128 to >> > > 64 >> bits; >> > > 64-bit GPR are used for the reduction from 64 to 32 bits; and 32-bit GPR >> > > are >> used >> > > for the rest of the reduction steps. >> > > >> > > For example, the test case (V8HI) >> > > int16_t foo (int16_t *a) >> > > { >> > > int16_t b = -1; >> > > for (int i = 0; i < 8; ++i) >> > > b &= a[i]; >> > > return b; >> > > } >> > > >> > > was previously compiled to (-O2): >> > > foo: >> > > ldr q0, [x0] >> > > movi v30.4s, 0 >> > > ext v29.16b, v0.16b, v30.16b, #8 >> > > and v29.16b, v29.16b, v0.16b >> > > ext v31.16b, v29.16b, v30.16b, #4 >> > > and v31.16b, v31.16b, v29.16b >> > > ext v30.16b, v31.16b, v30.16b, #2 >> > > and v30.16b, v30.16b, v31.16b >> > > umov w0, v30.h[0] >> > > ret >> > > >> > > With patch, it is compiled to: >> > > foo: >> > > ldr q31, [x0] >> > > ext v30.16b, v31.16b, v31.16b, #8 >> > > and v31.8b, v30.8b, v31.8b >> > > fmov x0, d31 >> > > and x0, x0, x0, lsr 32 >> > > and w0, w0, w0, lsr 16 >> > > ret >> > > >> > > For modes V4SI and V2DI, the pattern was not implemented, because the >> > > current codegen (using only base instructions) is already efficient. >> > > >> > > Note that the PR initially suggested to use SVE reduction ops. However, >> > > they have higher latency than the proposed sequence, which is why using >> > > neon and base instructions is preferable. >> > > >> > > Test cases were added for 8/16-bit integers for all implemented modes >> > > and all >> > > three operations to check the produced assembly. >> > > >> > > We also added [istarget aarch64*-*-*] to the selector vect_logical_reduc, >> > > because for aarch64 vector types, either the logical reduction optabs are >> > > implemented or the codegen for reduction operations is good as it is. >> > > This was motivated by failure of a scan-tree-dump directive in the test >> > > cases >> > > gcc.dg/vect/vect-reduc-or_1.c and gcc.dg/vect/vect-reduc-or_2.c. >> > > >> > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no >> regression. >> > > OK for mainline? >> > > >> > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >> > > >> > > gcc/ >> > > PR target/113816 >> > > * config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>): >> > > Implement for logical bitwise operations for VDQV_E. >> > > >> > > gcc/testsuite/ >> > > PR target/113816 >> > > * lib/target-supports.exp (vect_logical_reduc): Add aarch64*. >> > > * gcc.target/aarch64/simd/logical_reduc.c: New test. >> > > * gcc.target/aarch64/vect-reduc-or_1.c: Adjust expected outcome. >> > > --- >> > > gcc/config/aarch64/aarch64-simd.md | 55 +++++ >> > > .../gcc.target/aarch64/simd/logical_reduc.c | 208 ++++++++++++++++++ >> > > .../gcc.target/aarch64/vect-reduc-or_1.c | 2 +- >> > > gcc/testsuite/lib/target-supports.exp | 4 +- >> > > 4 files changed, 267 insertions(+), 2 deletions(-) >> > > create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c >> > > >> > > diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> > > index 23c03a96371..00286b8b020 100644 >> > > --- a/gcc/config/aarch64/aarch64-simd.md >> > > +++ b/gcc/config/aarch64/aarch64-simd.md >> > > @@ -3608,6 +3608,61 @@ >> > > } >> > > ) >> > > >> > > +;; Emit a sequence for bitwise logical reductions over vectors for >> > > V8QI, V16QI, >> > > +;; V4HI, and V8HI modes. The reduction is achieved by iteratively >> > > operating >> > > +;; on the two halves of the input. >> > > +;; If the input has 128 bits, the first operation is performed in vector >> > > +;; registers. From 64 bits down, the reduction steps are performed in >> > > general >> > > +;; purpose registers. >> > > +;; For example, for V8HI and operation AND, the intended sequence is: >> > > +;; EXT v1.16b, v0.16b, v0.16b, #8 >> > > +;; AND v0.8b, v1.8b, v0.8b >> > > +;; FMOV x0, d0 >> > > +;; AND x0, x0, x0, 32 >> > > +;; AND w0, w0, w0, 16 >> > > +;; >> > > +;; For V8QI and operation AND, the sequence is: >> > > +;; AND x0, x0, x0, lsr 32 >> > > +;; AND w0, w0, w0, lsr, 16 >> > > +;; AND w0, w0, w0, lsr, 8 >> > > + >> > > +(define_expand "reduc_<optab>_scal_<mode>" >> > > + [(match_operand:<VEL> 0 "register_operand") >> > > + (LOGICAL:VDQV_E (match_operand:VDQV_E 1 "register_operand"))] >> > > + "TARGET_SIMD" >> > > + { >> > > + rtx dst = operands[1]; >> > > + rtx tdi = gen_reg_rtx (DImode); >> > > + rtx tsi = lowpart_subreg (SImode, tdi, DImode); >> > > + rtx op1_lo; >> > > + if (known_eq (GET_MODE_SIZE (<MODE>mode), 16)) >> > > + { >> > > + rtx t0 = gen_reg_rtx (<MODE>mode); >> > > + rtx t1 = gen_reg_rtx (DImode); >> > > + rtx t2 = gen_reg_rtx (DImode); >> > > + rtx idx = GEN_INT (8 / GET_MODE_UNIT_SIZE (<MODE>mode)); >> > > + emit_insn (gen_aarch64_ext<mode> (t0, dst, dst, idx)); >> > > + op1_lo = lowpart_subreg (V2DImode, dst, <MODE>mode); >> > > + rtx t0_lo = lowpart_subreg (V2DImode, t0, <MODE>mode); >> > > + emit_insn (gen_aarch64_get_lanev2di (t1, op1_lo, GEN_INT (0))); >> > > + emit_insn (gen_aarch64_get_lanev2di (t2, t0_lo, GEN_INT (0))); >> > > + emit_insn (gen_<optab>di3 (t1, t1, t2)); >> > > + emit_move_insn (tdi, t1); >> > > + } >> > > + else >> > > + { >> > > + op1_lo = lowpart_subreg (DImode, dst, <MODE>mode); >> > > + emit_move_insn (tdi, op1_lo); >> > > + } >> > > + emit_insn (gen_<optab>_lshrdi3 (tdi, tdi, GEN_INT (32), tdi)); >> > > + emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (16), tsi)); >> > > + if (known_eq (GET_MODE_UNIT_BITSIZE (<MODE>mode), 8)) >> > > + emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (8), tsi)); >> > > + emit_move_insn (operands[0], lowpart_subreg (<VEL>mode, tsi, >> > > SImode)); >> > > + DONE; >> > > + } >> > > +) >> > >> > Sorry to be awkward, but I've got mixed feelings about this. >> > The sequence produced looks good. But we're not defining this >> > optab because the target has a special instruction for implementing >> > the operation. We're just implementing it to tweak the loop emitted >> > by vect_create_epilog_for_reduction. Maybe we should change that loop >> > instead, perhaps with a target hook? >> >> So does it currently not work because there's no v8qi, v4qi or v2qi >> vector modes? I'll note there already _is_ >> targetm.vectorize.split_reduction (mode), but it expects a vector >> mode as result and it doesn't get the operation in. >> > > The point of the reduction is that they shouldn't be done on the vector side. > So even if we have v4qi and v2qi the reduction shouldn't be done there as > the latency and throughput for these sub 64-bit operations on the vector > side is much higher and so should be done on the integer side. i.e. these > should be done on DI and SI. > > This is why I disagree with Richard that this can be done generically for all > targets. The decomposition takes advantage of Arm's fused shifted logical > operations on GPR. This makes them really cheap to do on integer.
Wouldn't it still be a win without that though? Having the fused operation gives us a ~4x improvement in latency on a typical core (EXT + vector AND -> fused scalar AND). But the unfused operation would still be a ~2x improvement. > Richard brought up AArch32 as another example. I don't agree there either > Because AArch32's register file overlap means you don't need shifts. > The first step reduction can just be done by accessing the registers > as even/odd pairs on a smaller mode. > > To get this codegen generically the targets must supports modes tie-able > between the vector and smaller vector size. i.e. BIT_FIELD_REF on the > vector mode to a smaller mode must be cheap or free. > > The decision for when you do the transfer from SIMD to GPR is highly > target specific, and the how as well. > >> I would expect that we can improve on this existing machinery >> instead of inventing a new one. >> > > But we're not. We're just implementing the optab to the machinery > that already exists. I don't think this kind of expansion is what optabs are there for though. I suppose it was in the pre-gimple days, when all optimisation was done on RTL. (Although of course that also predates vectorisation.) But now, I think the purpose of optabs is to advertise what the target can do via some form of direct h/w support. The original patch (using ANDV) was an ideal use of optabs. But I think the request to use a tweaked version of the current open-coded reduction scheme, although a good suggestion, also changes the way that we should go about it. > I think we're just over engineering a new contributor's patch. Even if > we think there is some merit to do it generically, I don't see why we > shouldn't take this patch today. Sorry if it comes across that way. Thanks, Richard