Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

Prathamesh Kulkarni via Gcc-patches Fri, 21 Apr 2023 00:28:58 -0700

On Wed, 12 Apr 2023 at 14:29, Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> > On Thu, 6 Apr 2023 at 16:05, Richard Sandiford
> > <richard.sandif...@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> >> > On Tue, 4 Apr 2023 at 23:35, Richard Sandiford
> >> > <richard.sandif...@arm.com> wrote:
> >> >> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> >> >> > b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > index cd9cace3c9b..3de79060619 100644
> >> >> > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > @@ -817,6 +817,62 @@ public:
> >> >> >
> >> >> >  class svdupq_impl : public quiet<function_base>
> >> >> >  {
> >> >> > +private:
> >> >> > +  gimple *
> >> >> > +  fold_nonconst_dupq (gimple_folder &f, unsigned factor) const
> >> >> > +  {
> >> >> > +    /* Lower lhs = svdupq (arg0, arg1, ..., argN} into:
> >> >> > +       tmp = {arg0, arg1, ..., arg<N-1>}
> >> >> > +       lhs = VEC_PERM_EXPR (tmp, tmp, {0, 1, 2, N-1, ...})  */
> >> >> > +
> >> >> > +    /* TODO: Revisit to handle factor by padding zeros.  */
> >> >> > +    if (factor > 1)
> >> >> > +      return NULL;
> >> >>
> >> >> Isn't the key thing here predicate vs. vector rather than factor == 1 
> >> >> vs.
> >> >> factor != 1?  Do we generate good code for b8, where factor should be 1?
> >> > Hi,
> >> > It generates the following code for svdup_n_b8:
> >> > https://pastebin.com/ypYt590c
> >>
> >> Hmm, yeah, not pretty :-)  But it's not pretty without either.
> >>
> >> > I suppose lowering to ctor+vec_perm_expr is not really useful
> >> > for this case because it won't simplify ctor, unlike the above case of
> >> > svdupq_s32 (x[0], x[1], x[2], x[3]);
> >> > However I wonder if it's still a good idea to lower svdupq for 
> >> > predicates, for
> >> > representing svdupq (or other intrinsics) using GIMPLE constructs as
> >> > far as possible ?
> >>
> >> It's possible, but I think we'd need an example in which its a clear
> >> benefit.
> > Sorry I posted for wrong test case above.
> > For the following test:
> > svbool_t f(uint8x16_t x)
> > {
> >   return svdupq_n_b8 (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7],
> >                                     x[8], x[9], x[10], x[11], x[12],
> > x[13], x[14], x[15]);
> > }
> >
> > Code-gen:
> > https://pastebin.com/maexgeJn
> >
> > I suppose it's equivalent to following ?
> >
> > svbool_t f2(uint8x16_t x)
> > {
> >   svuint8_t tmp = svdupq_n_u8 ((bool) x[0], (bool) x[1], (bool) x[2],
> > (bool) x[3],
> >                                (bool) x[4], (bool) x[5], (bool) x[6],
> > (bool) x[7],
> >                                (bool) x[8], (bool) x[9], (bool) x[10],
> > (bool) x[11],
> >                                (bool) x[12], (bool) x[13], (bool)
> > x[14], (bool) x[15]);
> >   return svcmpne_n_u8 (svptrue_b8 (), tmp, 0);
> > }
>
> Yeah, this is essentially the transformation that the svdupq rtl
> expander uses.  It would probably be a good idea to do that in
> gimple too.
Hi,
I tested the interleave+zip1 for vector init patch and it segfaulted
during bootstrap while trying to build
libgfortran/generated/matmul_i2.c.
Rebuilding with --enable-checking=rtl showed out of bounds access in
aarch64_unzip_vector_init in following hunk:


+  rtvec vec = rtvec_alloc (n / 2);
+  for (int i = 0; i < n; i++)
+    RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
+                                 : XVECEXP (vals, 0, 2 * i + 1);

which is incorrect since it allocates n/2 but iterates and stores upto n.
The attached patch fixes the issue, which passed bootstrap, however
resulted in following fallout during testsuite run:

1] sve/acle/general/dupq_[1-4].c tests fail.
For the following test:
int32x4_t f(int32_t x)
{
  return (int32x4_t) { x, 1, 2, 3 };
}

Code-gen without patch:
f:
        adrp    x1, .LC0
        ldr     q0, [x1, #:lo12:.LC0]
        ins     v0.s[0], w0
        ret

Code-gen with patch:
f:
        movi    v0.2s, 0x2
        adrp    x1, .LC0
        ldr     d1, [x1, #:lo12:.LC0]
        ins     v0.s[0], w0
        zip1    v0.4s, v0.4s, v1.4s
        ret

It shows, fallback_seq_cost = 20, seq_total_cost = 16
where seq_total_cost determines the cost for interleave+zip1 sequence
and fallback_seq_cost is the cost for fallback sequence.
Altho it shows lesser cost, I am not sure if the interleave+zip1
sequence is better in this case ?

2] sve/acle/general/dupq_[5-6].c tests fail:
int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
{
  return (int32x4_t) { x0, x1, x2, x3 };
}

code-gen without patch:
f:
        fmov    s0, w0
        ins     v0.s[1], w1
        ins     v0.s[2], w2
        ins     v0.s[3], w3
        ret

code-gen with patch:
f:
        fmov    s0, w0
        fmov    s1, w1
        ins     v0.s[1], w2
        ins     v1.s[1], w3
        zip1    v0.4s, v0.4s, v1.4s
        ret

It shows fallback_seq_cost = 28, seq_total_cost = 16

3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
Test case:
void cons2_8_float(float *x, float val0, float val1)
{
#pragma GCC unroll(8)
  for (int i = 0; i < 8 * 2; i += 2) {
    x[i + 0] = val0;
    x[i + 1] = val1;
  }
}

which is lowered to:
void cons2_8_float (float * x, float val0, float val1)
{
  vector(4) float _86;

  <bb 2> [local count: 119292720]:
  _86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)};
  MEM <vector(4) float> [(float *)x_10(D)] = _86;
  MEM <vector(4) float> [(float *)x_10(D) + 16B] = _86;
  MEM <vector(4) float> [(float *)x_10(D) + 32B] = _86;
  MEM <vector(4) float> [(float *)x_10(D) + 48B] = _86;
  return;
}

code-gen without patch:
cons2_8_float:
        dup     v0.4s, v0.s[0]
        ins     v0.s[1], v1.s[0]
        ins     v0.s[3], v1.s[0]
        stp     q0, q0, [x0]
        stp     q0, q0, [x0, 32]
        ret

code-gen with patch:
cons2_8_float:
        dup     v1.2s, v1.s[0]
        dup     v0.2s, v0.s[0]
        zip1    v0.4s, v0.4s, v1.4s
        stp     q0, q0, [x0]
        stp     q0, q0, [x0, 32]
        ret

It shows fallback_seq_cost = 28, seq_total_cost = 16

I think the test fails because it doesn't match:
**      dup     v([0-9]+)\.4s, .*

Shall it be OK to amend the test assuming code-gen with patch is better ?

4] aarch64/pr109072_1.c s32x4_3 test fails:
For the following test:
int32x4_t s32x4_3 (int32_t x, int32_t y)
{
  int32_t arr[] = { x, y, y, y };
  return vld1q_s32 (arr);
}

code-gen without patch:
s32x4_3:
        dup     v0.4s, w1
        ins     v0.s[0], w0
        ret

code-gen with patch:
s32x4_3:
        fmov    s1, w1
        fmov    s0, w0
        ins     v0.s[1], v1.s[0]
        dup     v1.2s, v1.s[0]
        zip1    v0.4s, v0.4s, v1.4s
        ret

It shows fallback_seq_cost = 20, seq_total_cost = 16
I am not sure how interleave+zip1 cost is lesser than fallback seq
cost for this case.
I assume that the fallback sequence is better here ?

PS: The patch for folding svdupq to ctor+vec_perm_expr passes
bootstrap+test without any issues.

Thanks,
Prathamesh

>
> Thanks,
> Richard
>
> >
> > which generates:
> > f2:
> > .LFB3901:
> >         .cfi_startproc
> >         movi    v1.16b, 0x1
> >         ptrue   p0.b, all
> >         cmeq    v0.16b, v0.16b, #0
> >         bic     v0.16b, v1.16b, v0.16b
> >         dup     z0.q, z0.q[0]
> >         cmpne   p0.b, p0/z, z0.b, #0
> >         ret
> >
> > Thanks,
> > Prathamesh

gnu-821-6.diff
Description: Binary data

Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

Reply via email to