Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Richard Biener Thu, 15 Nov 2018 04:49:19 -0800

On Thu, Nov 15, 2018 at 1:42 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Wed, Nov 14, 2018 at 4:47 PM Tamar Christina <tamar.christ...@arm.com> 
> wrote:
> >
> > Hi Richard,
> >
> > > > Ok for trunk?
> > >
> > > +;; The complex mla operations always need to expand to two instructions.
> > > +;; The first operation does half the computation and the second does
> > > +the ;; remainder.  Because of this, expand early.
> > > +(define_expand "fcmla<rot><mode>4"
> > > +  [(set (match_operand:VF 0 "register_operand")
> > > +       (plus:VF (match_operand:VF 1 "register_operand")
> > > +                (unspec:VF [(match_operand:VF 2 "register_operand")
> > > +                            (match_operand:VF 3 "register_operand")]
> > > +                            VCMLA)))]
> > > +  "TARGET_COMPLEX"
> > > +{
> > > +  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0],
> > > operands[1],
> > > +                                             operands[2],
> > > +operands[3]));
> > > +  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0],
> > > operands[0],
> > > +                                             operands[2],
> > > +operands[3]));
> > > +  DONE;
> > > +})
> > >
> > > What's the two halves?  Why hide this from the vectorizer if you go down 
> > > all
> > > to the detail and expose the rotation to it?
> > >
> >
> > The two halves are an implementation detail of the instruction in 
> > Armv8.3-a. As far as the
> > Vectorizer is concerned all you want to do, is an FMA rotating one of the 
> > operands by 0 or 180 degrees.
> >
> > Also note that the "rotations" in these instructions aren't exactly the 
> > same as what would fall under rotation of a complex number,
> > as each instruction can only do half of the final computation you want.
> >
> > In the ISA these instructions have to be used in a pair, where rotations 
> > determine
> > the operation you want to perform. E.g. a rotation of #0 followed by #90 
> > makes it a multiply and accumulate.
> >
> > A rotation of #180 followed by #90 makes this a vector complex subtract, 
> > which is intended to be used by the first call
> > using a register cleared with 0 (It becomes an "FMS" essentially if you 
> > don't clear the register).
> > Each "rotation" determine what operation is done and using which parts of 
> > the complex number. You change the
> > "rotations" and the grouping of the instructions to get different 
> > operations.
> >
> > I did not expose this to the vectorizer as It seems very ISA specific.
> >
> > > +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due
> > > +to the ;; fact that their usage need to guarantee that the source
> > > +vectors are ;; contiguous.  It would be wrong to describe the operation
> > > +without being able ;; to describe the permute that is also required,
> > > +but even if that is done ;; the permute would have been created as a
> > > +LOAD_LANES which means the values ;; in the registers are in the wrong
> > > order.
> > >
> > > Hmm, it's totally non-obvious to me how this relates to loads or what a 
> > > "non-
> > > contiguous"
> > > register would be?  That is, once you make this an unspec combine will 
> > > never
> > > be able to synthesize this from intrinsics code that doesn't use this 
> > > form.
> > >
> > > +(define_insn "neon_vcadd<rot><mode>"
> > > +  [(set (match_operand:VF 0 "register_operand" "=w")
> > > +       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
> > > +                   (match_operand:VF 2 "register_operand" "w")]
> > > +                   VCADD))]
> > >
> >
> > Yes that's my goal, as if operand1 and operand2 are loaded by instructions 
> > that
> > would have permuted the values in the register then the instruction doesn't 
> > work.
> >
> > The instruction does the permute itself, so it expects the values to have 
> > been loaded
> > using a simple load and not a LOAD_LANES. So I am intended to prevent 
> > combine from
> > recognizing the operation for that reason.
>
> But LOAD_LANES is used differently and the ISA probably doesn't really care 
> how
> you set up the register inputs.  You of course have to put in the
> correct values but
> how they get there doesn't matter.  So I don't see how combine can
> mess things up here.
>
> >  For the ADD combine can be used but then you'd
> > have to match the load and store since you have to change these, for the 
> > rest you'll run far afoul
> > of combine's 5 instruction limit.
>
> Why do you need to change these?  You assume the vectorizer vectorizes using
> interleaving - yes, in that case all hope is lost.  I assume the
> vectorizer will end up
> doing SLP with the existing TWO_OPERATORS support


You might be bitten by the fact that you tuned the vectorizer to always prefer
load/store-lanes over SLP when there are permutations.  You could lift
that a bit allowing rotation/projection permutes as they occur with complex
arithmetic.

> , thus for complex subtraction
> you'll see (A and B being complex vectors)
>
>    add = A + B;
>    sub = A - B;
>    resultAcomplex_minusB = vec_merge (add, sub, 1)
>
> basically the vectorizer will perform operations twice and then blend the two
> results.  The add/sub + blend needs to be recognized by combine
> (x86 does this for the vaddsub instructions which were designed to handle
> complex subtraction and parts of the multiply).
>
> For complex multiplication you'll see the pieces your ISA supports.
>
>   mul1 = A * B
>   mul2 = A * rot(B)  (rotation will be a shuffle)
>   add = mul1 + mul2
>   sub = mul1 - mul2
>   result = blend (add, sub, ...)
>
> as usual the combiner is helped by intermediate combiner patterns
> (in this case modeling your ISAs intermediate steps probably already helps).
> The x86 ISA also has fmaddsub/fmsubadd isntructions but without the
> embedded rotation which you have to do explicitely.  For example the
> vectorizer generates for a simple complex FMA loop
>
> _Complex double x[1024];
> _Complex double y[1024];
> _Complex double z[1024];
>
> void foo ()
> {
>   for (int i = 0; i < 1024; ++i)
>     x[i] += y[i] * z[i];
> }
>
>   <bb 3> [local count: 1063004407]:
>   # ivtmp.34_6 = PHI <0(2), ivtmp.34_12(3)>
>   vect__6.5_49 = MEM[symbol: x, index: ivtmp.34_6, offset: 0B];
>   vect__13.11_42 = MEM[symbol: y, index: ivtmp.34_6, offset: 0B];
>   vect__13.12_41 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 0, 0 }>;
>   vect__13.17_36 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 1, 1 }>;
>   vect__11.8_46 = MEM[symbol: z, index: ivtmp.34_6, offset: 0B];
>   vect__11.21_32 = VEC_PERM_EXPR <vect__11.8_46, vect__11.8_46, { 1, 0 }>;
>   vect__17.13_40 = vect__13.12_41 * vect__11.8_46;
>   vect__18.22_31 = vect__11.21_32 * vect__13.17_36;
>   vect__21.23_30 = vect__17.13_40 - vect__18.22_31;
>   vect__21.24_29 = vect__18.22_31 + vect__17.13_40;
>   _28 = VEC_PERM_EXPR <vect__21.23_30, vect__21.24_29, { 0, 3 }>;
>   vect__23.25_27 = _28 + vect__6.5_49;
>   MEM[symbol: x, index: ivtmp.34_6, offset: 0B] = vect__23.25_27;
>   ivtmp.34_12 = ivtmp.34_6 + 16;
>   if (ivtmp.34_12 != 16384)
>     goto <bb 3>; [99.00%]
>
> which before combine looks like
>
>     8: r92:V2DF=[r82:DI+`y']
>    10: r93:V2DF=[r82:DI+`z']
>    12: r97:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
>    13: r90:V2DF=r97:V2DF*r93:V2DF
>    14: r98:V2DF=vec_select(r93:V2DF,parallel)
>    15: r99:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
>    16: r87:V2DF=r98:V2DF*r99:V2DF
>    18: r101:V2DF=r90:V2DF-r87:V2DF
>    19: r102:V2DF=r87:V2DF+r90:V2DF
>    20: r103:V2DF=vec_merge(r101:V2DF,r102:V2DF,0x1)
>    22: r105:V2DF=r103:V2DF+[r82:DI+`x']
>    23: [r82:DI+`x']=r105:V2DF
>
> I assume you can combine the multiplications with the selects
> (the selects might be sth else for you - that's somewhat target depenent)
> into your half-way operations with the embedded rotates.
>
> Richard.
>
> >
> > Kind Regards,
> > Tamar
> >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
> > > >
> > > >         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
> > > >         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4):
> > > New.
> > > >         (arm_option_reconfigure_globals): Use them.
> > > >         * config/arm/iterators.md (VDF, VQ_HSF): New.
> > > >         (VCADD, VCMLA): New.
> > > >         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
> > > >         * config/arm/neon.md (neon_vcadd<rot><mode>,
> > > fcadd<rot><mode>3,
> > > >         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
> > > >         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
> > > >         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180,
> > > UNSPEC_VCMLA270): New.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
> > > >
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add 
> > > > Arm
> > > support.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: 
> > > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c:
> > > Likewise.
> > > >
> > > > --

Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Reply via email to