Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Friday, October 8, 2021 5:24 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Marcus Shawcroft >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH]AArch64 Make use of FADDP in simple reductions. >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> > Hi All, >> > >> > This is a respin of an older patch which never got upstream reviewed >> > by a maintainer. It's been updated to fit the current GCC codegen. >> > >> > This patch adds a pattern to support the (F)ADDP (scalar) instruction. >> > >> > Before the patch, the C code >> > >> > typedef float v4sf __attribute__((vector_size (16))); >> > >> > float >> > foo1 (v4sf x) >> > { >> > return x[0] + x[1]; >> > } >> > >> > generated: >> > >> > foo1: >> > dup s1, v0.s[1] >> > fadd s0, s1, s0 >> > ret >> > >> > After patch: >> > foo1: >> > faddp s0, v0.2s >> > ret >> > >> > The double case is now handled by SLP but the remaining cases still >> > need help from combine. I have kept the integer and floating point >> > separate because of the integer one only supports V2DI and sharing it >> > with the float would have required definition of a few new iterators for >> > just >> a single use. >> > >> > I provide support for when both elements are subregs as a different >> > pattern as there's no way to tell reload that the two registers must >> > be equal with just constraints. >> > >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >> > >> > Ok for master? >> > >> > Thanks, >> > Tamar >> > >> > gcc/ChangeLog: >> > >> > * config/aarch64/aarch64-simd.md (*aarch64_faddp_scalar<mode>, >> > *aarch64_addp_scalarv2di, *aarch64_faddp_scalar2<mode>, >> > *aarch64_addp_scalar2v2di): New. >> > >> > gcc/testsuite/ChangeLog: >> > >> > * gcc.target/aarch64/simd/scalar_faddp.c: New test. >> > * gcc.target/aarch64/simd/scalar_faddp2.c: New test. >> > * gcc.target/aarch64/simd/scalar_addp.c: New test. >> > >> > Co-authored-by: Tamar Christina <tamar.christ...@arm.com> >> > >> > --- inline copy of patch -- >> > diff --git a/gcc/config/aarch64/aarch64-simd.md >> > b/gcc/config/aarch64/aarch64-simd.md >> > index >> > >> 6814dae079c9ff40aaa2bb625432bf9eb8906b73..b49f8b79b11cbb1888c503d9a >> 938 >> > 4424f44bde05 100644 >> > --- a/gcc/config/aarch64/aarch64-simd.md >> > +++ b/gcc/config/aarch64/aarch64-simd.md >> > @@ -3414,6 +3414,70 @@ (define_insn "aarch64_faddp<mode>" >> > [(set_attr "type" "neon_fp_reduc_add_<stype><q>")] >> > ) >> > >> > +;; For the case where both operands are a subreg we need to use a ;; >> > +match_dup since reload cannot enforce that the registers are ;; the >> > +same with a constraint in this case. >> > +(define_insn "*aarch64_faddp_scalar2<mode>" >> > + [(set (match_operand:<VEL> 0 "register_operand" "=w") >> > + (plus:<VEL> >> > + (vec_select:<VEL> >> > + (match_operator:<VEL> 1 "subreg_lowpart_operator" >> > + [(match_operand:VHSDF 2 "register_operand" "w")]) >> > + (parallel [(match_operand 3 "const_int_operand" "n")])) >> > + (match_dup:<VEL> 2)))] >> > + "TARGET_SIMD >> > + && ENDIAN_LANE_N (<nunits>, INTVAL (operands[3])) == 1" >> > + "faddp\t%<Vetype>0, %2.2<Vetype>" >> > + [(set_attr "type" "neon_fp_reduc_add_<stype><q>")] >> > +) >> >> The difficulty with using match_dup here is that the first vec_select operand >> ought to fold to a REG after reload, rather than stay as a subreg. From that >> POV we're forcing the generation of non-canonical rtl. >> >> Also… >> >> > +(define_insn "*aarch64_faddp_scalar<mode>" >> > + [(set (match_operand:<VEL> 0 "register_operand" "=w") >> > + (plus:<VEL> >> > + (vec_select:<VEL> >> > + (match_operand:VHSDF 1 "register_operand" "w") >> > + (parallel [(match_operand 2 "const_int_operand" "n")])) >> > + (match_operand:<VEL> 3 "register_operand" "1")))] >> > + "TARGET_SIMD >> > + && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 1 >> > + && SUBREG_P (operands[3]) && !SUBREG_P (operands[1]) >> > + && subreg_lowpart_p (operands[3])" >> > + "faddp\t%<Vetype>0, %1.2<Vetype>" >> > + [(set_attr "type" "neon_fp_reduc_add_<stype><q>")] >> > +)
Also: It'd probably be better to use V2F for the iterator, since it excludes V4 and V8 modes. I think we can use vect_par_cnst_hi_half for operand 2. >> …matching constraints don't work reliably between two inputs: >> the RA doesn't know how to combine two different inputs into one input in >> order to make them match. >> >> Have you tried doing this as a define_peephole2 instead? >> That fits this kind of situation better (from an rtl representation point of >> view), but peephole2s are admittedly less powerful than combine. >> >> If peephole2s don't work then I think we'll have to provide a pattern that >> accepts two distinct inputs and then split the instruction if the inputs >> aren't in >> the same register. That sounds a bit ugly though, so it'd be good news if >> the >> peephole thing works out. > > Unfortunately peepholes don't work very well here because e.g. addp can be > Assigned by the regalloc to the integer side instead of simd, in which case > you > Can't use the instruction anymore. Ah, yeah, we wouldn't be able to recover easily from that. > The peepholes seem to only detect the simple FP cases. > > I tried adding something like a post-reload spit > > + "&& reload_completed && REGNO (operands[1]) != REGNO (operands[3])" > + [(clobber (match_scratch:<VEL> 4 "=w")) > + (set (match_dup 4) > + (vec_select:<VEL> > + (match_dup 1) > + (parallel [(match_dup 2)]))) > + (set (match_dup 0) > + (plus:<VEL> > + (match_dup 4) > + (match_dup 3)))] > + "" > > But this doesn't seem like it'll work as it needs a scratch? The clobber should go in the pattern itself, rather than in the split. E.g.: [(set (match_operand:<VEL> 0 "register_operand" "=w") (plus:<VEL> (vec_select:<VEL> (match_operand:V2F 1 "register_operand" "w") (match_operand 2 "vect_par_cnst_hi_half")) (match_operand:<VEL> 3 "register_operand" "w"))) (clobber (match_scratch:V2F 4 "=&w"))] recog will add the clobber when needed. However, it's probably simpler to make operand 0 itself earlyclobber: [(set (match_operand:<VEL> 0 "register_operand" "=&w") (plus:<VEL> (vec_select:<VEL> (match_operand:V2F 1 "register_operand" "w") (match_operand 2 "vect_par_cnst_hi_half")) (match_operand:<VEL> 3 "register_operand" "w")))] The output code should return "#" if the register numbers are different. All of this complication makes me think: couldn't we match this in gimple instead? I.e. check for an addition of the elements in a 2-element vector and match it to IFN_REDUC_PLUS (when supported)? Thanks, Richard