Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Richard Biener Thu, 15 Nov 2018 04:42:56 -0800

On Wed, Nov 14, 2018 at 4:47 PM Tamar Christina <tamar.christ...@arm.com> wrote:
>
> Hi Richard,
>
> > > Ok for trunk?
> >
> > +;; The complex mla operations always need to expand to two instructions.
> > +;; The first operation does half the computation and the second does
> > +the ;; remainder.  Because of this, expand early.
> > +(define_expand "fcmla<rot><mode>4"
> > +  [(set (match_operand:VF 0 "register_operand")
> > +       (plus:VF (match_operand:VF 1 "register_operand")
> > +                (unspec:VF [(match_operand:VF 2 "register_operand")
> > +                            (match_operand:VF 3 "register_operand")]
> > +                            VCMLA)))]
> > +  "TARGET_COMPLEX"
> > +{
> > +  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0],
> > operands[1],
> > +                                             operands[2],
> > +operands[3]));
> > +  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0],
> > operands[0],
> > +                                             operands[2],
> > +operands[3]));
> > +  DONE;
> > +})
> >
> > What's the two halves?  Why hide this from the vectorizer if you go down all
> > to the detail and expose the rotation to it?
> >
>
> The two halves are an implementation detail of the instruction in Armv8.3-a. 
> As far as the
> Vectorizer is concerned all you want to do, is an FMA rotating one of the 
> operands by 0 or 180 degrees.
>
> Also note that the "rotations" in these instructions aren't exactly the same 
> as what would fall under rotation of a complex number,
> as each instruction can only do half of the final computation you want.
>
> In the ISA these instructions have to be used in a pair, where rotations 
> determine
> the operation you want to perform. E.g. a rotation of #0 followed by #90 
> makes it a multiply and accumulate.
>
> A rotation of #180 followed by #90 makes this a vector complex subtract, 
> which is intended to be used by the first call
> using a register cleared with 0 (It becomes an "FMS" essentially if you don't 
> clear the register).
> Each "rotation" determine what operation is done and using which parts of the 
> complex number. You change the
> "rotations" and the grouping of the instructions to get different operations.
>
> I did not expose this to the vectorizer as It seems very ISA specific.
>
> > +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due
> > +to the ;; fact that their usage need to guarantee that the source
> > +vectors are ;; contiguous.  It would be wrong to describe the operation
> > +without being able ;; to describe the permute that is also required,
> > +but even if that is done ;; the permute would have been created as a
> > +LOAD_LANES which means the values ;; in the registers are in the wrong
> > order.
> >
> > Hmm, it's totally non-obvious to me how this relates to loads or what a 
> > "non-
> > contiguous"
> > register would be?  That is, once you make this an unspec combine will never
> > be able to synthesize this from intrinsics code that doesn't use this form.
> >
> > +(define_insn "neon_vcadd<rot><mode>"
> > +  [(set (match_operand:VF 0 "register_operand" "=w")
> > +       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
> > +                   (match_operand:VF 2 "register_operand" "w")]
> > +                   VCADD))]
> >
>
> Yes that's my goal, as if operand1 and operand2 are loaded by instructions 
> that
> would have permuted the values in the register then the instruction doesn't 
> work.
>
> The instruction does the permute itself, so it expects the values to have 
> been loaded
> using a simple load and not a LOAD_LANES. So I am intended to prevent combine 
> from
> recognizing the operation for that reason.


But LOAD_LANES is used differently and the ISA probably doesn't really care how
you set up the register inputs.  You of course have to put in the
correct values but
how they get there doesn't matter.  So I don't see how combine can
mess things up here.

>  For the ADD combine can be used but then you'd
> have to match the load and store since you have to change these, for the rest 
> you'll run far afoul
> of combine's 5 instruction limit.

Why do you need to change these?  You assume the vectorizer vectorizes using
interleaving - yes, in that case all hope is lost.  I assume the
vectorizer will end up
doing SLP with the existing TWO_OPERATORS support, thus for complex subtraction
you'll see (A and B being complex vectors)

   add = A + B;
   sub = A - B;
   resultAcomplex_minusB = vec_merge (add, sub, 1)

basically the vectorizer will perform operations twice and then blend the two
results.  The add/sub + blend needs to be recognized by combine
(x86 does this for the vaddsub instructions which were designed to handle
complex subtraction and parts of the multiply).

For complex multiplication you'll see the pieces your ISA supports.

  mul1 = A * B
  mul2 = A * rot(B)  (rotation will be a shuffle)
  add = mul1 + mul2
  sub = mul1 - mul2
  result = blend (add, sub, ...)

as usual the combiner is helped by intermediate combiner patterns
(in this case modeling your ISAs intermediate steps probably already helps).
The x86 ISA also has fmaddsub/fmsubadd isntructions but without the
embedded rotation which you have to do explicitely.  For example the
vectorizer generates for a simple complex FMA loop

_Complex double x[1024];
_Complex double y[1024];
_Complex double z[1024];

void foo ()
{
  for (int i = 0; i < 1024; ++i)
    x[i] += y[i] * z[i];
}

  <bb 3> [local count: 1063004407]:
  # ivtmp.34_6 = PHI <0(2), ivtmp.34_12(3)>
  vect__6.5_49 = MEM[symbol: x, index: ivtmp.34_6, offset: 0B];
  vect__13.11_42 = MEM[symbol: y, index: ivtmp.34_6, offset: 0B];
  vect__13.12_41 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 0, 0 }>;
  vect__13.17_36 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 1, 1 }>;
  vect__11.8_46 = MEM[symbol: z, index: ivtmp.34_6, offset: 0B];
  vect__11.21_32 = VEC_PERM_EXPR <vect__11.8_46, vect__11.8_46, { 1, 0 }>;
  vect__17.13_40 = vect__13.12_41 * vect__11.8_46;
  vect__18.22_31 = vect__11.21_32 * vect__13.17_36;
  vect__21.23_30 = vect__17.13_40 - vect__18.22_31;
  vect__21.24_29 = vect__18.22_31 + vect__17.13_40;
  _28 = VEC_PERM_EXPR <vect__21.23_30, vect__21.24_29, { 0, 3 }>;
  vect__23.25_27 = _28 + vect__6.5_49;
  MEM[symbol: x, index: ivtmp.34_6, offset: 0B] = vect__23.25_27;
  ivtmp.34_12 = ivtmp.34_6 + 16;
  if (ivtmp.34_12 != 16384)
    goto <bb 3>; [99.00%]

which before combine looks like

    8: r92:V2DF=[r82:DI+`y']
   10: r93:V2DF=[r82:DI+`z']
   12: r97:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
   13: r90:V2DF=r97:V2DF*r93:V2DF
   14: r98:V2DF=vec_select(r93:V2DF,parallel)
   15: r99:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
   16: r87:V2DF=r98:V2DF*r99:V2DF
   18: r101:V2DF=r90:V2DF-r87:V2DF
   19: r102:V2DF=r87:V2DF+r90:V2DF
   20: r103:V2DF=vec_merge(r101:V2DF,r102:V2DF,0x1)
   22: r105:V2DF=r103:V2DF+[r82:DI+`x']
   23: [r82:DI+`x']=r105:V2DF

I assume you can combine the multiplications with the selects
(the selects might be sth else for you - that's somewhat target depenent)
into your half-way operations with the embedded rotates.

Richard.

>
> Kind Regards,
> Tamar
>
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
> > >
> > >         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
> > >         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4):
> > New.
> > >         (arm_option_reconfigure_globals): Use them.
> > >         * config/arm/iterators.md (VDF, VQ_HSF): New.
> > >         (VCADD, VCMLA): New.
> > >         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
> > >         * config/arm/neon.md (neon_vcadd<rot><mode>,
> > fcadd<rot><mode>3,
> > >         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
> > >         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
> > >         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180,
> > UNSPEC_VCMLA270): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
> > >
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm
> > support.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: 
> > > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c:
> > Likewise.
> > >
> > > --

Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Reply via email to