RE: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex Addition, Multiply and FMA.

Tamar Christina via Gcc-patches Mon, 16 Nov 2020 05:10:51 -0800


> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: Monday, November 16, 2020 12:47 PM
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; Richard Earnshaw
> <[email protected]>; nd <[email protected]>; Marcus Shawcroft
> <[email protected]>
> Subject: Re: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex
> Addition, Multiply and FMA.
> 
> Tamar Christina <[email protected]> writes:
> >> > +(define_int_attr rot_op [(UNSPEC_FCMLS "")
> >> > +                         (UNSPEC_FCMLS180 "_conj")
> >> > +                         (UNSPEC_FCMLA "")
> >> > +                         (UNSPEC_FCMLA180 "_conj")
> >> > +                         (UNSPEC_FCMUL "")
> >> > +                         (UNSPEC_FCMUL180 "_conj")
> >> > +                         (UNSPEC_CMLS "")
> >> > +                         (UNSPEC_CMLA "")
> >> > +                         (UNSPEC_CMLA180 "_conj")
> >> > +                         (UNSPEC_CMUL "")
> >> > +                         (UNSPEC_CMUL180 "_conj")])
> >> > +
> >> > +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> >> > +                            (UNSPEC_FCMLA180 "0")
> >> > +                            (UNSPEC_FCMUL "0")
> >> > +                            (UNSPEC_FCMUL180 "0")
> >> > +                            (UNSPEC_FCMLS "270")
> >> > +                            (UNSPEC_FCMLS180 "90")])
> >> > +
> >> > +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> >> > +                            (UNSPEC_FCMLA180 "270")
> >> > +                            (UNSPEC_FCMUL "90")
> >> > +                            (UNSPEC_FCMUL180 "270")
> >> > +                            (UNSPEC_FCMLS "180")
> >> > +                            (UNSPEC_FCMLS180 "180")])
> >>
> >> I think it would be worth clarifying what the patterns do.  AIUI, the
> >> effect of the MUL180 and MLA180 “rot_op” cases as written in the
> >> patch is to multiply the conjugate of the first operand by the second
> >> operand, whereas the documentation in the earlier patches implied
> >> that they should multiply the conjugate of the second operand by the first.
> >> But maybe I've got this wrong.
> >>
> >>   operands to the pattern: a + bi, c + di
> >>
> >>   conjugate of first operand * second operand [A]:
> >>
> >>     (a - bi) * (c + di) = (ac + bd) + (ad - bc)i
> >>
> >>   first operand * conjugate of second operand [B]:
> >>
> >>     (a + bi) * (c - di) = (ac + bd) + (bc - ad)i
> >>
> >>   the FCMUL180 sequence chosen by the patch:
> >>
> >>     0: a * (c + di) = ac + adi
> >>   270: b * (d - ci) = bd - bci
> >>
> >>   which gives A rather than B.
> >
> > Correct, it expects the conjucate In the second operand and corrects
> > the argument order to account for it during vectorization if it's on the 
> > first
> operand.
> 
> What's “it” here though?  The point of my comment above is that the
> AArch64 expansion seems to conjugate the first operand rather than the
> second.  When I read the iterator part of the patch originally, I was
> wondering whether the AArch64 define_expand would swap the inputs to
> ensure that the second operand is conjugated instead, but the
> define_expand seems to preserve the original order.
> 
> So it seemed like it was the other way around from what you said above:
> the optabs interface treats conjugating the first operand as the “native”
> choice and the vectoriser would need to correct the argument order if the
> scalar gimple code conjugated the second operand instead.
> 
> TBH I'm not sure which is the “natural” order in gimple.  Guess that's one for
> Richi.
> 
> Still…
> 
> > i.e. conjucate first returns
> >
> >         mov     v0.16b, v3.16b
> >         ldr     q1, [x0, x3]
> >         ldr     q2, [x1, x3]
> >         fcmla   v0.4s, v1.4s, v2.4s, #0
> >         fcmla   v0.4s, v1.4s, v2.4s, #270
> >         str     q0, [x2, x3]
> >
> > and conjucate second returns
> >
> >         mov     v0.16b, v3.16b
> >         ldr     q1, [x1, x3]
> >         ldr     q2, [x0, x3]
> >         fcmla   v0.4s, v1.4s, v2.4s, #0
> >         fcmla   v0.4s, v1.4s, v2.4s, #270
> >         str     q0, [x2, x3]
> 
> …I agree this looks good, so it seems like everything works out in the end.
> I guess it's just a question of clarifying the interface.
> 
> > I think at some point the documentation in got out of sync with the
> > implementation because it turned out to be easier to conj first and flip 
> > conj
> second.  I'll update the optab docs.
> 
> Thanks.
> 
> >> > +;; The complex mla/mls operations always need to expand to two
> >> instructions.
> >> > +;; The first operation does half the computation and the second
> >> > +does the ;; remainder.  Because of this, expand early.
> >> > +(define_expand "cml<fcmac1><rot_op><mode>4"
> >> > +  [(set (match_operand:VHSDF 0 "register_operand")
> >> > +        (plus:VHSDF (match_operand:VHSDF 1 "register_operand")
> >> > +                    (unspec:VHSDF [(match_operand:VHSDF 2
> >> "register_operand")
> >> > +                                   (match_operand:VHSDF 3
> >> "register_operand")]
> >> > +                                   FCMLA_OP)))]
> >> > +  "TARGET_COMPLEX"
> >> > +{
> >> > +  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (operands[0],
> >> operands[1],
> >> > +                                                 operands[2], 
> >> > operands[3]));
> >> > +  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0],
> >> operands[0],
> >> > +                                                 operands[2], 
> >> > operands[3]));
> >> > +  DONE;
> >> > +})
> >>
> >> The interface doesn't guarantee that operands[0] is distinct from the
> >> inputs, so I think the result of the first instruction should be a 
> >> temporary
> register.
> >
> > Hmm sorry I'm not sure I follow.. operands[0] doesn't have to be
> > distinct from the inputs, and in most cases it wouldn't be.
> >
> > In c = c + (a * b)
> >
> > operands[0] should be the same as operands[1].
> >
> > I'm just trying to figure out why It needs to be different :) since it's an
> accumulation instruction..
> 
> Consider:
> 
>   c = c + (a * c)
> 
> With the expansion above, the first instruction would clobber the operands[3]
> input to the second instruction.


AHHH! I had never considered that.. Thanks! I'll update all the target patches.

Regards,
Tamar

> 
> Thanks,
> Richard

RE: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex Addition, Multiply and FMA.

Reply via email to