On Thu, Nov 15, 2018 at 1:42 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Nov 14, 2018 at 4:47 PM Tamar Christina <tamar.christ...@arm.com> > wrote: > > > > Hi Richard, > > > > > > Ok for trunk? > > > > > > +;; The complex mla operations always need to expand to two instructions. > > > +;; The first operation does half the computation and the second does > > > +the ;; remainder. Because of this, expand early. > > > +(define_expand "fcmla<rot><mode>4" > > > + [(set (match_operand:VF 0 "register_operand") > > > + (plus:VF (match_operand:VF 1 "register_operand") > > > + (unspec:VF [(match_operand:VF 2 "register_operand") > > > + (match_operand:VF 3 "register_operand")] > > > + VCMLA)))] > > > + "TARGET_COMPLEX" > > > +{ > > > + emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0], > > > operands[1], > > > + operands[2], > > > +operands[3])); > > > + emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0], > > > operands[0], > > > + operands[2], > > > +operands[3])); > > > + DONE; > > > +}) > > > > > > What's the two halves? Why hide this from the vectorizer if you go down > > > all > > > to the detail and expose the rotation to it? > > > > > > > The two halves are an implementation detail of the instruction in > > Armv8.3-a. As far as the > > Vectorizer is concerned all you want to do, is an FMA rotating one of the > > operands by 0 or 180 degrees. > > > > Also note that the "rotations" in these instructions aren't exactly the > > same as what would fall under rotation of a complex number, > > as each instruction can only do half of the final computation you want. > > > > In the ISA these instructions have to be used in a pair, where rotations > > determine > > the operation you want to perform. E.g. a rotation of #0 followed by #90 > > makes it a multiply and accumulate. > > > > A rotation of #180 followed by #90 makes this a vector complex subtract, > > which is intended to be used by the first call > > using a register cleared with 0 (It becomes an "FMS" essentially if you > > don't clear the register). > > Each "rotation" determine what operation is done and using which parts of > > the complex number. You change the > > "rotations" and the grouping of the instructions to get different > > operations. > > > > I did not expose this to the vectorizer as It seems very ISA specific. > > > > > +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due > > > +to the ;; fact that their usage need to guarantee that the source > > > +vectors are ;; contiguous. It would be wrong to describe the operation > > > +without being able ;; to describe the permute that is also required, > > > +but even if that is done ;; the permute would have been created as a > > > +LOAD_LANES which means the values ;; in the registers are in the wrong > > > order. > > > > > > Hmm, it's totally non-obvious to me how this relates to loads or what a > > > "non- > > > contiguous" > > > register would be? That is, once you make this an unspec combine will > > > never > > > be able to synthesize this from intrinsics code that doesn't use this > > > form. > > > > > > +(define_insn "neon_vcadd<rot><mode>" > > > + [(set (match_operand:VF 0 "register_operand" "=w") > > > + (unspec:VF [(match_operand:VF 1 "register_operand" "w") > > > + (match_operand:VF 2 "register_operand" "w")] > > > + VCADD))] > > > > > > > Yes that's my goal, as if operand1 and operand2 are loaded by instructions > > that > > would have permuted the values in the register then the instruction doesn't > > work. > > > > The instruction does the permute itself, so it expects the values to have > > been loaded > > using a simple load and not a LOAD_LANES. So I am intended to prevent > > combine from > > recognizing the operation for that reason. > > But LOAD_LANES is used differently and the ISA probably doesn't really care > how > you set up the register inputs. You of course have to put in the > correct values but > how they get there doesn't matter. So I don't see how combine can > mess things up here. > > > For the ADD combine can be used but then you'd > > have to match the load and store since you have to change these, for the > > rest you'll run far afoul > > of combine's 5 instruction limit. > > Why do you need to change these? You assume the vectorizer vectorizes using > interleaving - yes, in that case all hope is lost. I assume the > vectorizer will end up > doing SLP with the existing TWO_OPERATORS support
You might be bitten by the fact that you tuned the vectorizer to always prefer load/store-lanes over SLP when there are permutations. You could lift that a bit allowing rotation/projection permutes as they occur with complex arithmetic. > , thus for complex subtraction > you'll see (A and B being complex vectors) > > add = A + B; > sub = A - B; > resultAcomplex_minusB = vec_merge (add, sub, 1) > > basically the vectorizer will perform operations twice and then blend the two > results. The add/sub + blend needs to be recognized by combine > (x86 does this for the vaddsub instructions which were designed to handle > complex subtraction and parts of the multiply). > > For complex multiplication you'll see the pieces your ISA supports. > > mul1 = A * B > mul2 = A * rot(B) (rotation will be a shuffle) > add = mul1 + mul2 > sub = mul1 - mul2 > result = blend (add, sub, ...) > > as usual the combiner is helped by intermediate combiner patterns > (in this case modeling your ISAs intermediate steps probably already helps). > The x86 ISA also has fmaddsub/fmsubadd isntructions but without the > embedded rotation which you have to do explicitely. For example the > vectorizer generates for a simple complex FMA loop > > _Complex double x[1024]; > _Complex double y[1024]; > _Complex double z[1024]; > > void foo () > { > for (int i = 0; i < 1024; ++i) > x[i] += y[i] * z[i]; > } > > <bb 3> [local count: 1063004407]: > # ivtmp.34_6 = PHI <0(2), ivtmp.34_12(3)> > vect__6.5_49 = MEM[symbol: x, index: ivtmp.34_6, offset: 0B]; > vect__13.11_42 = MEM[symbol: y, index: ivtmp.34_6, offset: 0B]; > vect__13.12_41 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 0, 0 }>; > vect__13.17_36 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 1, 1 }>; > vect__11.8_46 = MEM[symbol: z, index: ivtmp.34_6, offset: 0B]; > vect__11.21_32 = VEC_PERM_EXPR <vect__11.8_46, vect__11.8_46, { 1, 0 }>; > vect__17.13_40 = vect__13.12_41 * vect__11.8_46; > vect__18.22_31 = vect__11.21_32 * vect__13.17_36; > vect__21.23_30 = vect__17.13_40 - vect__18.22_31; > vect__21.24_29 = vect__18.22_31 + vect__17.13_40; > _28 = VEC_PERM_EXPR <vect__21.23_30, vect__21.24_29, { 0, 3 }>; > vect__23.25_27 = _28 + vect__6.5_49; > MEM[symbol: x, index: ivtmp.34_6, offset: 0B] = vect__23.25_27; > ivtmp.34_12 = ivtmp.34_6 + 16; > if (ivtmp.34_12 != 16384) > goto <bb 3>; [99.00%] > > which before combine looks like > > 8: r92:V2DF=[r82:DI+`y'] > 10: r93:V2DF=[r82:DI+`z'] > 12: r97:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel) > 13: r90:V2DF=r97:V2DF*r93:V2DF > 14: r98:V2DF=vec_select(r93:V2DF,parallel) > 15: r99:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel) > 16: r87:V2DF=r98:V2DF*r99:V2DF > 18: r101:V2DF=r90:V2DF-r87:V2DF > 19: r102:V2DF=r87:V2DF+r90:V2DF > 20: r103:V2DF=vec_merge(r101:V2DF,r102:V2DF,0x1) > 22: r105:V2DF=r103:V2DF+[r82:DI+`x'] > 23: [r82:DI+`x']=r105:V2DF > > I assume you can combine the multiplications with the selects > (the selects might be sth else for you - that's somewhat target depenent) > into your half-way operations with the embedded rotates. > > Richard. > > > > > Kind Regards, > > Tamar > > > > > > > > > Thanks, > > > > Tamar > > > > > > > > gcc/ChangeLog: > > > > > > > > 2018-11-11 Tamar Christina <tamar.christ...@arm.com> > > > > > > > > * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New. > > > > * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4): > > > New. > > > > (arm_option_reconfigure_globals): Use them. > > > > * config/arm/iterators.md (VDF, VQ_HSF): New. > > > > (VCADD, VCMLA): New. > > > > (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF. > > > > * config/arm/neon.md (neon_vcadd<rot><mode>, > > > fcadd<rot><mode>3, > > > > neon_vcmla<rot><mode>, fcmla<rot><mode>4): New. > > > > * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270, > > > > UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180, > > > UNSPEC_VCMLA270): New. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > 2018-11-11 Tamar Christina <tamar.christ...@arm.com> > > > > > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add > > > > Arm > > > support. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: > > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c: > > > Likewise. > > > > * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c: > > > Likewise. > > > > > > > > --