Re: [PATCH]AArch64 Make use of FADDP in simple reductions.

Richard Sandiford via Gcc-patches Mon, 01 Nov 2021 01:49:50 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Friday, October 8, 2021 5:24 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>> <richard.earns...@arm.com>; Marcus Shawcroft
>> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
>> Subject: Re: [PATCH]AArch64 Make use of FADDP in simple reductions.
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > Hi All,
>> >
>> > This is a respin of an older patch which never got upstream reviewed
>> > by a maintainer.  It's been updated to fit the current GCC codegen.
>> >
>> > This patch adds a pattern to support the (F)ADDP (scalar) instruction.
>> >
>> > Before the patch, the C code
>> >
>> > typedef float v4sf __attribute__((vector_size (16)));
>> >
>> > float
>> > foo1 (v4sf x)
>> > {
>> >   return x[0] + x[1];
>> > }
>> >
>> > generated:
>> >
>> > foo1:
>> >    dup     s1, v0.s[1]
>> >    fadd    s0, s1, s0
>> >    ret
>> >
>> > After patch:
>> > foo1:
>> >    faddp   s0, v0.2s
>> >    ret
>> >
>> > The double case is now handled by SLP but the remaining cases still
>> > need help from combine.  I have kept the integer and floating point
>> > separate because of the integer one only supports V2DI and sharing it
>> > with the float would have required definition of a few new iterators for 
>> > just
>> a single use.
>> >
>> > I provide support for when both elements are subregs as a different
>> > pattern as there's no way to tell reload that the two registers must
>> > be equal with just constraints.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >    * config/aarch64/aarch64-simd.md (*aarch64_faddp_scalar<mode>,
>> >    *aarch64_addp_scalarv2di, *aarch64_faddp_scalar2<mode>,
>> >    *aarch64_addp_scalar2v2di): New.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >    * gcc.target/aarch64/simd/scalar_faddp.c: New test.
>> >    * gcc.target/aarch64/simd/scalar_faddp2.c: New test.
>> >    * gcc.target/aarch64/simd/scalar_addp.c: New test.
>> >
>> > Co-authored-by: Tamar Christina <tamar.christ...@arm.com>
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> 6814dae079c9ff40aaa2bb625432bf9eb8906b73..b49f8b79b11cbb1888c503d9a
>> 938
>> > 4424f44bde05 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -3414,6 +3414,70 @@ (define_insn "aarch64_faddp<mode>"
>> >    [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> >  )
>> >
>> > +;; For the case where both operands are a subreg we need to use a ;;
>> > +match_dup since reload cannot enforce that the registers are ;; the
>> > +same with a constraint in this case.
>> > +(define_insn "*aarch64_faddp_scalar2<mode>"
>> > +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
>> > +  (plus:<VEL>
>> > +    (vec_select:<VEL>
>> > +      (match_operator:<VEL> 1 "subreg_lowpart_operator"
>> > +        [(match_operand:VHSDF 2 "register_operand" "w")])
>> > +      (parallel [(match_operand 3 "const_int_operand" "n")]))
>> > +    (match_dup:<VEL> 2)))]
>> > +  "TARGET_SIMD
>> > +   && ENDIAN_LANE_N (<nunits>, INTVAL (operands[3])) == 1"
>> > +  "faddp\t%<Vetype>0, %2.2<Vetype>"
>> > +  [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> > +)
>> 
>> The difficulty with using match_dup here is that the first vec_select operand
>> ought to fold to a REG after reload, rather than stay as a subreg.  From that
>> POV we're forcing the generation of non-canonical rtl.
>> 
>> Also…
>> 
>> > +(define_insn "*aarch64_faddp_scalar<mode>"
>> > +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
>> > +  (plus:<VEL>
>> > +    (vec_select:<VEL>
>> > +      (match_operand:VHSDF 1 "register_operand" "w")
>> > +      (parallel [(match_operand 2 "const_int_operand" "n")]))
>> > +    (match_operand:<VEL> 3 "register_operand" "1")))]
>> > +  "TARGET_SIMD
>> > +   && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 1
>> > +   && SUBREG_P (operands[3]) && !SUBREG_P (operands[1])
>> > +   && subreg_lowpart_p (operands[3])"
>> > +  "faddp\t%<Vetype>0, %1.2<Vetype>"
>> > +  [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> > +)


Also:

It'd probably be better to use V2F for the iterator, since it excludes
V4 and V8 modes.

I think we can use vect_par_cnst_hi_half for operand 2.

>> …matching constraints don't work reliably between two inputs:
>> the RA doesn't know how to combine two different inputs into one input in
>> order to make them match.
>> 
>> Have you tried doing this as a define_peephole2 instead?
>> That fits this kind of situation better (from an rtl representation point of
>> view), but peephole2s are admittedly less powerful than combine.
>> 
>> If peephole2s don't work then I think we'll have to provide a pattern that
>> accepts two distinct inputs and then split the instruction if the inputs 
>> aren't in
>> the same register.  That sounds a bit ugly though, so it'd be good news if 
>> the
>> peephole thing works out.
>
> Unfortunately peepholes don't work very well here because e.g. addp can be
> Assigned by the regalloc to the integer side instead of simd, in which case 
> you
> Can't use the instruction anymore.

Ah, yeah, we wouldn't be able to recover easily from that.

> The peepholes seem to only detect the simple FP cases.
>
> I tried adding something like a post-reload spit
>
> +  "&& reload_completed && REGNO (operands[1]) != REGNO (operands[3])"
> +  [(clobber (match_scratch:<VEL> 4 "=w"))
> +   (set (match_dup 4)
> +       (vec_select:<VEL>
> +         (match_dup 1)
> +         (parallel [(match_dup 2)])))
> +   (set (match_dup 0)
> +       (plus:<VEL>
> +         (match_dup 4)
> +         (match_dup 3)))]
> +  ""
>
> But this doesn't seem like it'll work as it needs a scratch?

The clobber should go in the pattern itself, rather than in the split.  E.g.:

  [(set (match_operand:<VEL> 0 "register_operand" "=w")
        (plus:<VEL>
          (vec_select:<VEL>
            (match_operand:V2F 1 "register_operand" "w")
            (match_operand 2 "vect_par_cnst_hi_half"))
          (match_operand:<VEL> 3 "register_operand" "w")))
   (clobber (match_scratch:V2F 4 "=&w"))]

recog will add the clobber when needed.

However, it's probably simpler to make operand 0 itself earlyclobber:

  [(set (match_operand:<VEL> 0 "register_operand" "=&w")
        (plus:<VEL>
          (vec_select:<VEL>
            (match_operand:V2F 1 "register_operand" "w")
            (match_operand 2 "vect_par_cnst_hi_half"))
          (match_operand:<VEL> 3 "register_operand" "w")))]

The output code should return "#" if the register numbers are different.

All of this complication makes me think: couldn't we match this in
gimple instead?  I.e. check for an addition of the elements in a
2-element vector and match it to IFN_REDUC_PLUS (when supported)?

Thanks,
Richard

Re: [PATCH]AArch64 Make use of FADDP in simple reductions.

Reply via email to