On Wed, 12 Apr 2023 at 14:29, Richard Sandiford <[email protected]> wrote: > > Prathamesh Kulkarni <[email protected]> writes: > > On Thu, 6 Apr 2023 at 16:05, Richard Sandiford > > <[email protected]> wrote: > >> > >> Prathamesh Kulkarni <[email protected]> writes: > >> > On Tue, 4 Apr 2023 at 23:35, Richard Sandiford > >> > <[email protected]> wrote: > >> >> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > index cd9cace3c9b..3de79060619 100644 > >> >> > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > @@ -817,6 +817,62 @@ public: > >> >> > > >> >> > class svdupq_impl : public quiet<function_base> > >> >> > { > >> >> > +private: > >> >> > + gimple * > >> >> > + fold_nonconst_dupq (gimple_folder &f, unsigned factor) const > >> >> > + { > >> >> > + /* Lower lhs = svdupq (arg0, arg1, ..., argN} into: > >> >> > + tmp = {arg0, arg1, ..., arg<N-1>} > >> >> > + lhs = VEC_PERM_EXPR (tmp, tmp, {0, 1, 2, N-1, ...}) */ > >> >> > + > >> >> > + /* TODO: Revisit to handle factor by padding zeros. */ > >> >> > + if (factor > 1) > >> >> > + return NULL; > >> >> > >> >> Isn't the key thing here predicate vs. vector rather than factor == 1 > >> >> vs. > >> >> factor != 1? Do we generate good code for b8, where factor should be 1? > >> > Hi, > >> > It generates the following code for svdup_n_b8: > >> > https://pastebin.com/ypYt590c > >> > >> Hmm, yeah, not pretty :-) But it's not pretty without either. > >> > >> > I suppose lowering to ctor+vec_perm_expr is not really useful > >> > for this case because it won't simplify ctor, unlike the above case of > >> > svdupq_s32 (x[0], x[1], x[2], x[3]); > >> > However I wonder if it's still a good idea to lower svdupq for > >> > predicates, for > >> > representing svdupq (or other intrinsics) using GIMPLE constructs as > >> > far as possible ? > >> > >> It's possible, but I think we'd need an example in which its a clear > >> benefit. > > Sorry I posted for wrong test case above. > > For the following test: > > svbool_t f(uint8x16_t x) > > { > > return svdupq_n_b8 (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], > > x[8], x[9], x[10], x[11], x[12], > > x[13], x[14], x[15]); > > } > > > > Code-gen: > > https://pastebin.com/maexgeJn > > > > I suppose it's equivalent to following ? > > > > svbool_t f2(uint8x16_t x) > > { > > svuint8_t tmp = svdupq_n_u8 ((bool) x[0], (bool) x[1], (bool) x[2], > > (bool) x[3], > > (bool) x[4], (bool) x[5], (bool) x[6], > > (bool) x[7], > > (bool) x[8], (bool) x[9], (bool) x[10], > > (bool) x[11], > > (bool) x[12], (bool) x[13], (bool) > > x[14], (bool) x[15]); > > return svcmpne_n_u8 (svptrue_b8 (), tmp, 0); > > } > > Yeah, this is essentially the transformation that the svdupq rtl > expander uses. It would probably be a good idea to do that in > gimple too. Hi, I tested the interleave+zip1 for vector init patch and it segfaulted during bootstrap while trying to build libgfortran/generated/matmul_i2.c. Rebuilding with --enable-checking=rtl showed out of bounds access in aarch64_unzip_vector_init in following hunk:
+ rtvec vec = rtvec_alloc (n / 2);
+ for (int i = 0; i < n; i++)
+ RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
+ : XVECEXP (vals, 0, 2 * i + 1);
which is incorrect since it allocates n/2 but iterates and stores upto n.
The attached patch fixes the issue, which passed bootstrap, however
resulted in following fallout during testsuite run:
1] sve/acle/general/dupq_[1-4].c tests fail.
For the following test:
int32x4_t f(int32_t x)
{
return (int32x4_t) { x, 1, 2, 3 };
}
Code-gen without patch:
f:
adrp x1, .LC0
ldr q0, [x1, #:lo12:.LC0]
ins v0.s[0], w0
ret
Code-gen with patch:
f:
movi v0.2s, 0x2
adrp x1, .LC0
ldr d1, [x1, #:lo12:.LC0]
ins v0.s[0], w0
zip1 v0.4s, v0.4s, v1.4s
ret
It shows, fallback_seq_cost = 20, seq_total_cost = 16
where seq_total_cost determines the cost for interleave+zip1 sequence
and fallback_seq_cost is the cost for fallback sequence.
Altho it shows lesser cost, I am not sure if the interleave+zip1
sequence is better in this case ?
2] sve/acle/general/dupq_[5-6].c tests fail:
int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
{
return (int32x4_t) { x0, x1, x2, x3 };
}
code-gen without patch:
f:
fmov s0, w0
ins v0.s[1], w1
ins v0.s[2], w2
ins v0.s[3], w3
ret
code-gen with patch:
f:
fmov s0, w0
fmov s1, w1
ins v0.s[1], w2
ins v1.s[1], w3
zip1 v0.4s, v0.4s, v1.4s
ret
It shows fallback_seq_cost = 28, seq_total_cost = 16
3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
Test case:
void cons2_8_float(float *x, float val0, float val1)
{
#pragma GCC unroll(8)
for (int i = 0; i < 8 * 2; i += 2) {
x[i + 0] = val0;
x[i + 1] = val1;
}
}
which is lowered to:
void cons2_8_float (float * x, float val0, float val1)
{
vector(4) float _86;
<bb 2> [local count: 119292720]:
_86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)};
MEM <vector(4) float> [(float *)x_10(D)] = _86;
MEM <vector(4) float> [(float *)x_10(D) + 16B] = _86;
MEM <vector(4) float> [(float *)x_10(D) + 32B] = _86;
MEM <vector(4) float> [(float *)x_10(D) + 48B] = _86;
return;
}
code-gen without patch:
cons2_8_float:
dup v0.4s, v0.s[0]
ins v0.s[1], v1.s[0]
ins v0.s[3], v1.s[0]
stp q0, q0, [x0]
stp q0, q0, [x0, 32]
ret
code-gen with patch:
cons2_8_float:
dup v1.2s, v1.s[0]
dup v0.2s, v0.s[0]
zip1 v0.4s, v0.4s, v1.4s
stp q0, q0, [x0]
stp q0, q0, [x0, 32]
ret
It shows fallback_seq_cost = 28, seq_total_cost = 16
I think the test fails because it doesn't match:
** dup v([0-9]+)\.4s, .*
Shall it be OK to amend the test assuming code-gen with patch is better ?
4] aarch64/pr109072_1.c s32x4_3 test fails:
For the following test:
int32x4_t s32x4_3 (int32_t x, int32_t y)
{
int32_t arr[] = { x, y, y, y };
return vld1q_s32 (arr);
}
code-gen without patch:
s32x4_3:
dup v0.4s, w1
ins v0.s[0], w0
ret
code-gen with patch:
s32x4_3:
fmov s1, w1
fmov s0, w0
ins v0.s[1], v1.s[0]
dup v1.2s, v1.s[0]
zip1 v0.4s, v0.4s, v1.4s
ret
It shows fallback_seq_cost = 20, seq_total_cost = 16
I am not sure how interleave+zip1 cost is lesser than fallback seq
cost for this case.
I assume that the fallback sequence is better here ?
PS: The patch for folding svdupq to ctor+vec_perm_expr passes
bootstrap+test without any issues.
Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> >
> > which generates:
> > f2:
> > .LFB3901:
> > .cfi_startproc
> > movi v1.16b, 0x1
> > ptrue p0.b, all
> > cmeq v0.16b, v0.16b, #0
> > bic v0.16b, v1.16b, v0.16b
> > dup z0.q, z0.q[0]
> > cmpne p0.b, p0/z, z0.b, #0
> > ret
> >
> > Thanks,
> > Prathamesh
gnu-821-6.diff
Description: Binary data
