Re: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex Addition, Multiply and FMA.

Richard Biener via Gcc-patches Mon, 16 Nov 2020 05:59:53 -0800

On Mon, Nov 16, 2020 at 1:48 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> > +(define_int_attr rot_op [(UNSPEC_FCMLS "")
> >> > +                   (UNSPEC_FCMLS180 "_conj")
> >> > +                   (UNSPEC_FCMLA "")
> >> > +                   (UNSPEC_FCMLA180 "_conj")
> >> > +                   (UNSPEC_FCMUL "")
> >> > +                   (UNSPEC_FCMUL180 "_conj")
> >> > +                   (UNSPEC_CMLS "")
> >> > +                   (UNSPEC_CMLA "")
> >> > +                   (UNSPEC_CMLA180 "_conj")
> >> > +                   (UNSPEC_CMUL "")
> >> > +                   (UNSPEC_CMUL180 "_conj")])
> >> > +
> >> > +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> >> > +                      (UNSPEC_FCMLA180 "0")
> >> > +                      (UNSPEC_FCMUL "0")
> >> > +                      (UNSPEC_FCMUL180 "0")
> >> > +                      (UNSPEC_FCMLS "270")
> >> > +                      (UNSPEC_FCMLS180 "90")])
> >> > +
> >> > +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> >> > +                      (UNSPEC_FCMLA180 "270")
> >> > +                      (UNSPEC_FCMUL "90")
> >> > +                      (UNSPEC_FCMUL180 "270")
> >> > +                      (UNSPEC_FCMLS "180")
> >> > +                      (UNSPEC_FCMLS180 "180")])
> >>
> >> I think it would be worth clarifying what the patterns do.  AIUI, the 
> >> effect of
> >> the MUL180 and MLA180 “rot_op” cases as written in the patch is to multiply
> >> the conjugate of the first operand by the second operand, whereas the
> >> documentation in the earlier patches implied that they should multiply the
> >> conjugate of the second operand by the first.
> >> But maybe I've got this wrong.
> >>
> >>   operands to the pattern: a + bi, c + di
> >>
> >>   conjugate of first operand * second operand [A]:
> >>
> >>     (a - bi) * (c + di) = (ac + bd) + (ad - bc)i
> >>
> >>   first operand * conjugate of second operand [B]:
> >>
> >>     (a + bi) * (c - di) = (ac + bd) + (bc - ad)i
> >>
> >>   the FCMUL180 sequence chosen by the patch:
> >>
> >>     0: a * (c + di) = ac + adi
> >>   270: b * (d - ci) = bd - bci
> >>
> >>   which gives A rather than B.
> >
> > Correct, it expects the conjucate In the second operand and corrects the 
> > argument order to account
> > for it during vectorization if it's on the first operand.
>
> What's “it” here though?  The point of my comment above is that the
> AArch64 expansion seems to conjugate the first operand rather than the
> second.  When I read the iterator part of the patch originally, I was
> wondering whether the AArch64 define_expand would swap the inputs to ensure
> that the second operand is conjugated instead, but the define_expand
> seems to preserve the original order.
>
> So it seemed like it was the other way around from what you said above:
> the optabs interface treats conjugating the first operand as the
> “native” choice and the vectoriser would need to correct the argument
> order if the scalar gimple code conjugated the second operand instead.
>
> TBH I'm not sure which is the “natural” order in gimple.  Guess that's
> one for Richi.


Not sure but most remotely related ops like
WIDEN_MULT_PLUS_EXPR have the "unmodified" operand
last.  But this is an internal function, so ...

I'd say make it consistent within the group.

And for all means please somewhere write down in complex
components what the ops are supposed to do ;)  Guess most
people know how to conjugate but rotations ... I only realized
_way_ late that x86 has vector operations for all of this
as well just because there it's all called 'plusminus' rather
than based on complex numbers and 'rotates' ...

Richard.

> Still…
>
> > i.e. conjucate first returns
> >
> >         mov     v0.16b, v3.16b
> >         ldr     q1, [x0, x3]
> >         ldr     q2, [x1, x3]
> >         fcmla   v0.4s, v1.4s, v2.4s, #0
> >         fcmla   v0.4s, v1.4s, v2.4s, #270
> >         str     q0, [x2, x3]
> >
> > and conjucate second returns
> >
> >         mov     v0.16b, v3.16b
> >         ldr     q1, [x1, x3]
> >         ldr     q2, [x0, x3]
> >         fcmla   v0.4s, v1.4s, v2.4s, #0
> >         fcmla   v0.4s, v1.4s, v2.4s, #270
> >         str     q0, [x2, x3]
>
> …I agree this looks good, so it seems like everything works out in the end.
> I guess it's just a question of clarifying the interface.
>
> > I think at some point the documentation in got out of sync with the 
> > implementation because it turned out to be easier to conj first
> > and flip conj second.  I'll update the optab docs.
>
> Thanks.
>
> >> > +;; The complex mla/mls operations always need to expand to two
> >> instructions.
> >> > +;; The first operation does half the computation and the second does
> >> > +the ;; remainder.  Because of this, expand early.
> >> > +(define_expand "cml<fcmac1><rot_op><mode>4"
> >> > +  [(set (match_operand:VHSDF 0 "register_operand")
> >> > +  (plus:VHSDF (match_operand:VHSDF 1 "register_operand")
> >> > +              (unspec:VHSDF [(match_operand:VHSDF 2
> >> "register_operand")
> >> > +                             (match_operand:VHSDF 3
> >> "register_operand")]
> >> > +                             FCMLA_OP)))]
> >> > +  "TARGET_COMPLEX"
> >> > +{
> >> > +  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (operands[0],
> >> operands[1],
> >> > +                                           operands[2], operands[3]));
> >> > +  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0],
> >> operands[0],
> >> > +                                           operands[2], operands[3]));
> >> > +  DONE;
> >> > +})
> >>
> >> The interface doesn't guarantee that operands[0] is distinct from the 
> >> inputs,
> >> so I think the result of the first instruction should be a temporary 
> >> register.
> >
> > Hmm sorry I'm not sure I follow.. operands[0] doesn't have to be distinct 
> > from the inputs,
> > and in most cases it wouldn't be.
> >
> > In c = c + (a * b)
> >
> > operands[0] should be the same as operands[1].
> >
> > I'm just trying to figure out why It needs to be different :) since it's an 
> > accumulation instruction..
>
> Consider:
>
>   c = c + (a * c)
>
> With the expansion above, the first instruction would clobber the
> operands[3] input to the second instruction.
>
> Thanks,
> Richard

Re: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex Addition, Multiply and FMA.

Reply via email to