> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, November 16, 2020 12:47 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; Richard Earnshaw > <richard.earns...@arm.com>; nd <n...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com> > Subject: Re: [PATCH v2 10/16]AArch64: Add NEON RTL patterns for Complex > Addition, Multiply and FMA. > > Tamar Christina <tamar.christ...@arm.com> writes: > >> > +(define_int_attr rot_op [(UNSPEC_FCMLS "") > >> > + (UNSPEC_FCMLS180 "_conj") > >> > + (UNSPEC_FCMLA "") > >> > + (UNSPEC_FCMLA180 "_conj") > >> > + (UNSPEC_FCMUL "") > >> > + (UNSPEC_FCMUL180 "_conj") > >> > + (UNSPEC_CMLS "") > >> > + (UNSPEC_CMLA "") > >> > + (UNSPEC_CMLA180 "_conj") > >> > + (UNSPEC_CMUL "") > >> > + (UNSPEC_CMUL180 "_conj")]) > >> > + > >> > +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0") > >> > + (UNSPEC_FCMLA180 "0") > >> > + (UNSPEC_FCMUL "0") > >> > + (UNSPEC_FCMUL180 "0") > >> > + (UNSPEC_FCMLS "270") > >> > + (UNSPEC_FCMLS180 "90")]) > >> > + > >> > +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90") > >> > + (UNSPEC_FCMLA180 "270") > >> > + (UNSPEC_FCMUL "90") > >> > + (UNSPEC_FCMUL180 "270") > >> > + (UNSPEC_FCMLS "180") > >> > + (UNSPEC_FCMLS180 "180")]) > >> > >> I think it would be worth clarifying what the patterns do. AIUI, the > >> effect of the MUL180 and MLA180 “rot_op” cases as written in the > >> patch is to multiply the conjugate of the first operand by the second > >> operand, whereas the documentation in the earlier patches implied > >> that they should multiply the conjugate of the second operand by the first. > >> But maybe I've got this wrong. > >> > >> operands to the pattern: a + bi, c + di > >> > >> conjugate of first operand * second operand [A]: > >> > >> (a - bi) * (c + di) = (ac + bd) + (ad - bc)i > >> > >> first operand * conjugate of second operand [B]: > >> > >> (a + bi) * (c - di) = (ac + bd) + (bc - ad)i > >> > >> the FCMUL180 sequence chosen by the patch: > >> > >> 0: a * (c + di) = ac + adi > >> 270: b * (d - ci) = bd - bci > >> > >> which gives A rather than B. > > > > Correct, it expects the conjucate In the second operand and corrects > > the argument order to account for it during vectorization if it's on the > > first > operand. > > What's “it” here though? The point of my comment above is that the > AArch64 expansion seems to conjugate the first operand rather than the > second. When I read the iterator part of the patch originally, I was > wondering whether the AArch64 define_expand would swap the inputs to > ensure that the second operand is conjugated instead, but the > define_expand seems to preserve the original order. > > So it seemed like it was the other way around from what you said above: > the optabs interface treats conjugating the first operand as the “native” > choice and the vectoriser would need to correct the argument order if the > scalar gimple code conjugated the second operand instead. > > TBH I'm not sure which is the “natural” order in gimple. Guess that's one for > Richi. > > Still… > > > i.e. conjucate first returns > > > > mov v0.16b, v3.16b > > ldr q1, [x0, x3] > > ldr q2, [x1, x3] > > fcmla v0.4s, v1.4s, v2.4s, #0 > > fcmla v0.4s, v1.4s, v2.4s, #270 > > str q0, [x2, x3] > > > > and conjucate second returns > > > > mov v0.16b, v3.16b > > ldr q1, [x1, x3] > > ldr q2, [x0, x3] > > fcmla v0.4s, v1.4s, v2.4s, #0 > > fcmla v0.4s, v1.4s, v2.4s, #270 > > str q0, [x2, x3] > > …I agree this looks good, so it seems like everything works out in the end. > I guess it's just a question of clarifying the interface. > > > I think at some point the documentation in got out of sync with the > > implementation because it turned out to be easier to conj first and flip > > conj > second. I'll update the optab docs. > > Thanks. > > >> > +;; The complex mla/mls operations always need to expand to two > >> instructions. > >> > +;; The first operation does half the computation and the second > >> > +does the ;; remainder. Because of this, expand early. > >> > +(define_expand "cml<fcmac1><rot_op><mode>4" > >> > + [(set (match_operand:VHSDF 0 "register_operand") > >> > + (plus:VHSDF (match_operand:VHSDF 1 "register_operand") > >> > + (unspec:VHSDF [(match_operand:VHSDF 2 > >> "register_operand") > >> > + (match_operand:VHSDF 3 > >> "register_operand")] > >> > + FCMLA_OP)))] > >> > + "TARGET_COMPLEX" > >> > +{ > >> > + emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (operands[0], > >> operands[1], > >> > + operands[2], > >> > operands[3])); > >> > + emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], > >> operands[0], > >> > + operands[2], > >> > operands[3])); > >> > + DONE; > >> > +}) > >> > >> The interface doesn't guarantee that operands[0] is distinct from the > >> inputs, so I think the result of the first instruction should be a > >> temporary > register. > > > > Hmm sorry I'm not sure I follow.. operands[0] doesn't have to be > > distinct from the inputs, and in most cases it wouldn't be. > > > > In c = c + (a * b) > > > > operands[0] should be the same as operands[1]. > > > > I'm just trying to figure out why It needs to be different :) since it's an > accumulation instruction.. > > Consider: > > c = c + (a * c) > > With the expansion above, the first instruction would clobber the operands[3] > input to the second instruction.
AHHH! I had never considered that.. Thanks! I'll update all the target patches. Regards, Tamar > > Thanks, > Richard