STR for 128-bit VLS

Tamar Christina Sun, 27 Apr 2025 00:18:54 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Friday, April 25, 2025 6:55 PM
> To: Jennifer Schmitz <jschm...@nvidia.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit 
> VLS
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
> > If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a
> > ptrue predicate can be replaced by neon instructions (LDR and STR),
> > thus avoiding the predicate altogether. This also enables formation of
> > LDP/STP pairs.
> >
> > For example, the test cases
> >
> > svfloat64_t
> > ptrue_load (float64_t *x)
> > {
> >   svbool_t pg = svptrue_b64 ();
> >   return svld1_f64 (pg, x);
> > }
> > void
> > ptrue_store (float64_t *x, svfloat64_t data)
> > {
> >   svbool_t pg = svptrue_b64 ();
> >   return svst1_f64 (pg, x, data);
> > }
> >
> > were previously compiled to
> > (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128):
> >
> > ptrue_load:
> >         ptrue   p3.b, vl16
> >         ld1d    z0.d, p3/z, [x0]
> >         ret
> > ptrue_store:
> >         ptrue   p3.b, vl16
> >         st1d    z0.d, p3, [x0]
> >         ret
> >
> > Now the are compiled to:
> >
> > ptrue_load:
> >         ldr     q0, [x0]
> >         ret
> > ptrue_store:
> >         str     q0, [x0]
> >         ret
> >
> > The implementation includes the if-statement
> > if (known_eq (BYTES_PER_SVE_VECTOR, 16)
> >     && known_eq (GET_MODE_SIZE (mode), 16))
> >
> > which checks for 128-bit VLS and excludes partial modes with a
> > mode size < 128 (e.g. VNx2QI).
> 
> I think it would be better to use:
> 
> if (known_eq (GET_MODE_SIZE (mode), 16)
>     && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA
> 
> to defend against any partial structure modes that might be added in future.
>


Hi Both,

Just a suggestion so feel free to ignore, but I do wonder if this optimization 
shouldn't
look at the predicate bits rather than the mode size.  Since this is valid for 
any load
where the predicate uses the lower N bits where n corresponds to an Adv. SIMD 
register
size.

e.g. it should be valid for:

#include <arm_sve.h>

svfloat64_t
ptrue_load (float64_t *x)
{
  svbool_t pg = svptrue_pat_b8 (SV_VL16);
  return svld1_f64 (pg, x);
}

void
ptrue_store (float64_t *x, svfloat64_t data)
{
  svbool_t pg = svptrue_pat_b8 (SV_VL16);
  return svst1_f64 (pg, x, data);
}

In general, along with

#include <arm_sve.h>

svfloat64_t
ptrue_load (float64_t *x)
{
  svbool_t pg = svptrue_pat_b8 (SV_VL8);
  return svld1_f64 (pg, x);
}

void
ptrue_store (float64_t *x, svfloat64_t data)
{
  svbool_t pg = svptrue_pat_b8 (SV_VL8);
  return svst1_f64 (pg, x, data);
}

It just so happens that at VL128 the SV_VL16 == SV_ALL.  Looking at the 
predicate bits
instead would help optimize all codegen.

Thanks,
Tamar
> >
> > The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
> > OK for mainline?
> >
> > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >
> > gcc/
> >     * config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move):
> >     Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS.
> >
> > gcc/testsuite/
> >     * gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test.
> >     * gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome.
> >     * gcc.target/aarch64/sve/pst/return_4_128.c: Likewise.
> >     * gcc.target/aarch64/sve/pst/return_5_128.c: Likewise.
> >     * gcc.target/aarch64/sve/pst/struct_3_128.c: Likewise.
> > ---
> >  gcc/config/aarch64/aarch64.cc                 | 27 ++++++--
> >  .../gcc.target/aarch64/sve/cond_arith_6.c     |  3 +-
> >  .../aarch64/sve/ldst_ptrue_128_to_neon.c      | 36 +++++++++++
> >  .../gcc.target/aarch64/sve/pcs/return_4_128.c | 39 ++++-------
> >  .../gcc.target/aarch64/sve/pcs/return_5_128.c | 39 ++++-------
> >  .../gcc.target/aarch64/sve/pcs/struct_3_128.c | 64 +++++--------------
> >  6 files changed, 102 insertions(+), 106 deletions(-)
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index f7bccf532f8..ac01149276b 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -6416,13 +6416,28 @@ aarch64_stack_protect_canary_mem
> (machine_mode mode, rtx decl_rtl,
> >  void
> >  aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src)
> >  {
> > -  expand_operand ops[3];
> >    machine_mode mode = GET_MODE (dest);
> > -  create_output_operand (&ops[0], dest, mode);
> > -  create_input_operand (&ops[1], pred, GET_MODE(pred));
> > -  create_input_operand (&ops[2], src, mode);
> > -  temporary_volatile_ok v (true);
> > -  expand_insn (code_for_aarch64_pred_mov (mode), 3, ops);
> > +  if ((MEM_P (dest) || MEM_P (src))
> > +      && known_eq (BYTES_PER_SVE_VECTOR, 16)
> > +      && known_eq (GET_MODE_SIZE (mode), 16)
> > +      && !BYTES_BIG_ENDIAN)
> > +    {
> > +      rtx tmp = gen_reg_rtx (V16QImode);
> > +      emit_move_insn (tmp, lowpart_subreg (V16QImode, src, mode));
> > +      if (MEM_P (src))
> > +   emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
> > +      else
> > +   emit_move_insn (adjust_address (dest, V16QImode, 0), tmp);
> 
> We shouldn't usually need a temporary register for the store case.
> Also, using lowpart_subreg for a source memory leads to the best-avoided
> subregs of mems when the mem is volatile, due to:
> 
>       /* Allow splitting of volatile memory references in case we don't
>          have instruction to move the whole thing.  */
>       && (! MEM_VOLATILE_P (op)
>         || ! have_insn_for (SET, innermode))
> 
> in simplify_subreg.  So how about:
> 
>       if (MEM_P (src))
>       {
>         rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0));
>         emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
>       }
>       else
>       emit_move_insn (adjust_address (dest, V16QImode, 0),
>                       force_lowpart_subreg (V16QImode, src, mode));
> 
> It might be good to test the volatile case too.  That case does work
> with your patch, since the subreg gets ironed out later.  It's just for
> completeness.
> 
> Thanks,
> Richard
> 
> > +    }
> > +  else
> > +    {
> > +      expand_operand ops[3];
> > +      create_output_operand (&ops[0], dest, mode);
> > +      create_input_operand (&ops[1], pred, GET_MODE(pred));
> > +      create_input_operand (&ops[2], src, mode);
> > +      temporary_volatile_ok v (true);
> > +      expand_insn (code_for_aarch64_pred_mov (mode), 3, ops);
> > +    }
> >  }
> >
> >  /* Expand a pre-RA SVE data move from SRC to DEST in which at least one
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c
> > index 4085ab12444..d5a12f1df07 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c
> > @@ -8,7 +8,8 @@ f (float *x)
> >        x[i] -= 1.0f;
> >  }
> >
> > -/* { dg-final { scan-assembler {\tld1w\tz} } } */
> > +/* { dg-final { scan-assembler {\tld1w\tz} { target aarch64_big_endian } } 
> > } */
> > +/* { dg-final { scan-assembler {\tldr\tq} { target aarch64_little_endian } 
> > } } */
> >  /* { dg-final { scan-assembler {\tfcmgt\tp} } } */
> >  /* { dg-final { scan-assembler {\tfsub\tz} } } */
> >  /* { dg-final { scan-assembler {\tst1w\tz} } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c
> b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c
> > new file mode 100644
> > index 00000000000..69f42b121ad
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c
> > @@ -0,0 +1,36 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msve-vector-bits=128" } */
> > +/* { dg-require-effective-target aarch64_little_endian } */
> > +
> > +#include <arm_sve.h>
> > +
> > +#define TEST(TYPE, TY, B)                          \
> > +  sv##TYPE                                         \
> > +  ld1_##TY##B (TYPE *x)                                    \
> > +  {                                                        \
> > +    svbool_t pg = svptrue_b##B ();                 \
> > +    return svld1_##TY##B (pg, x);                  \
> > +  }                                                        \
> > +                                                   \
> > +  void                                                     \
> > +  st1_##TY##B (TYPE *x, sv##TYPE data)                     \
> > +  {                                                        \
> > +    svbool_t pg = svptrue_b##B ();                 \
> > +    return svst1_##TY##B (pg, x, data);                    \
> > +  }                                                        \
> > +
> > +TEST (bfloat16_t, bf, 16)
> > +TEST (float16_t, f, 16)
> > +TEST (float32_t, f, 32)
> > +TEST (float64_t, f, 64)
> > +TEST (int8_t, s, 8)
> > +TEST (int16_t, s, 16)
> > +TEST (int32_t, s, 32)
> > +TEST (int64_t, s, 64)
> > +TEST (uint8_t, u, 8)
> > +TEST (uint16_t, u, 16)
> > +TEST (uint32_t, u, 32)
> > +TEST (uint64_t, u, 64)
> > +
> > +/* { dg-final { scan-assembler-times {\tldr\tq0, \[x0\]} 12 } } */
> > +/* { dg-final { scan-assembler-times {\tstr\tq0, \[x0\]} 12 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> > index 87d528c84cd..ac5f981490a 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> > @@ -11,104 +11,91 @@
> >
> >  /*
> >  ** callee_s8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s8, __SVInt8_t)
> >
> >  /*
> >  ** callee_u8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u8, __SVUint8_t)
> >
> >  /*
> >  ** callee_mf8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (mf8, __SVMfloat8_t)
> >
> >  /*
> >  ** callee_s16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s16, __SVInt16_t)
> >
> >  /*
> >  ** callee_u16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u16, __SVUint16_t)
> >
> >  /*
> >  ** callee_f16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f16, __SVFloat16_t)
> >
> >  /*
> >  ** callee_bf16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (bf16, __SVBfloat16_t)
> >
> >  /*
> >  ** callee_s32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s32, __SVInt32_t)
> >
> >  /*
> >  ** callee_u32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u32, __SVUint32_t)
> >
> >  /*
> >  ** callee_f32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f32, __SVFloat32_t)
> >
> >  /*
> >  ** callee_s64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s64, __SVInt64_t)
> >
> >  /*
> >  ** callee_u64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u64, __SVUint64_t)
> >
> >  /*
> >  ** callee_f64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f64, __SVFloat64_t)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> > index 347a16c1367..2fab6feb41c 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> > @@ -13,104 +13,91 @@
> >
> >  /*
> >  ** callee_s8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s8, svint8_t)
> >
> >  /*
> >  ** callee_u8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u8, svuint8_t)
> >
> >  /*
> >  ** callee_mf8:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (mf8, svmfloat8_t)
> >
> >  /*
> >  ** callee_s16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s16, svint16_t)
> >
> >  /*
> >  ** callee_u16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u16, svuint16_t)
> >
> >  /*
> >  ** callee_f16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f16, svfloat16_t)
> >
> >  /*
> >  ** callee_bf16:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1h    z0\.h, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (bf16, svbfloat16_t)
> >
> >  /*
> >  ** callee_s32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s32, svint32_t)
> >
> >  /*
> >  ** callee_u32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u32, svuint32_t)
> >
> >  /*
> >  ** callee_f32:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1w    z0\.s, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f32, svfloat32_t)
> >
> >  /*
> >  ** callee_s64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (s64, svint64_t)
> >
> >  /*
> >  ** callee_u64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (u64, svuint64_t)
> >
> >  /*
> >  ** callee_f64:
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  CALLEE (f64, svfloat64_t)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c
> > index d99ce1202a9..370bd9e3bfe 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c
> > @@ -473,17 +473,11 @@ SEL2 (struct, pst_uniform4)
> >  ** sub     sp, sp, #144
> >  ** add     (x[0-9]+), sp, #?31
> >  ** and     x7, \1, #?(?:-32|4294967264)
> > -** ptrue   (p[0-7])\.b, vl16
> > -** st1w    z0\.s, \2, \[x7\]
> > -** add     (x[0-9]+), x7, #?32
> > -** (
> > -** str     z1, \[\3\]
> > -** str     z2, \[\3, #1, mul vl\]
> > -** |
> > -** stp     q1, q2, \[\3\]
> > -** )
> > -** str     z3, \[\3, #2, mul vl\]
> > -** st1w    z4\.s, \2, \[x7, #6, mul vl\]
> > +** mov     x0, x7
> > +** str     q0, \[x0\], 32
> > +** stp     q1, q2, \[x0\]
> > +** str     z3, \[x0, #2, mul vl\]
> > +** str     q4, \[x7, 96\]
> >  ** add     sp, sp, #?144
> >  ** ret
> >  */
> > @@ -516,20 +510,12 @@ SEL2 (struct, pst_mixed1)
> >  ** test_pst_mixed1:
> >  ** sub     sp, sp, #176
> >  ** str     p0, \[sp\]
> > -** ptrue   p0\.b, vl16
> > -** st1h    z0\.h, p0, \[sp, #1, mul vl\]
> > -** st1h    z1\.h, p0, \[sp, #2, mul vl\]
> > -** st1w    z2\.s, p0, \[sp, #3, mul vl\]
> > -** st1d    z3\.d, p0, \[sp, #4, mul vl\]
> > +** stp     q0, q1, \[sp, 16\]
> > +** stp     q2, q3, \[sp, 48\]
> >  ** str     p1, \[sp, #40, mul vl\]
> >  ** str     p2, \[sp, #41, mul vl\]
> > -** st1b    z4\.b, p0, \[sp, #6, mul vl\]
> > -** st1h    z5\.h, p0, \[sp, #7, mul vl\]
> > -** ...
> > -** st1w    z6\.s, p0, [^\n]*
> > -** ...
> > -** st1d    z7\.d, p0, [^\n]*
> > -** ...
> > +** stp     q4, q5, \[sp, 96\]
> > +** stp     q6, q7, \[sp, 128\]
> >  ** str     p3, \[sp, #80, mul vl\]
> >  ** mov     (x7, sp|w7, wsp)
> >  ** add     sp, sp, #?176
> > @@ -557,24 +543,13 @@ SEL2 (struct, pst_mixed2)
> >  ** test_pst_mixed2:
> >  ** sub     sp, sp, #128
> >  ** str     p0, \[sp\]
> > -** ptrue   (p[03])\.b, vl16
> > -** add     (x[0-9]+), sp, #?2
> > -** st1b    z0\.b, \1, \[\2\]
> > +** str     q0, \[sp, 2\]
> >  ** str     p1, \[sp, #9, mul vl\]
> > -** add     (x[0-9]+), sp, #?20
> > -** st1b    z1\.b, \1, \[\3\]
> > +** str     q1, \[sp, 20\]
> >  ** str     p2, \[sp, #18, mul vl\]
> > -** add     (x[0-9]+), sp, #?38
> > -** st1b    z2\.b, \1, \[\4\]
> > -** (
> > -** str     z3, \[sp, #4, mul vl\]
> > -** str     z4, \[sp, #5, mul vl\]
> > -** str     z5, \[sp, #6, mul vl\]
> > -** str     z6, \[sp, #7, mul vl\]
> > -** |
> > +** str     q2, \[sp, 38\]
> >  ** stp     q3, q4, \[sp, 64\]
> >  ** stp     q5, q6, \[sp, 96\]
> > -** )
> >  ** mov     (x7, sp|w7, wsp)
> >  ** add     sp, sp, #?128
> >  ** ret
> > @@ -595,8 +570,7 @@ SEL2 (struct, pst_big1)
> >
> >  /*
> >  ** test_pst_big1_a: { target lp64 }
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  /*
> > @@ -760,8 +734,7 @@ test_pst_big3_d (struct pst_big3 x)
> >
> >  /*
> >  ** test_pst_big3_e: { target lp64 }
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0, #1, mul vl\]
> > +** ldr     q0, \[x0, 16\]
> >  ** ret
> >  */
> >  /*
> > @@ -780,8 +753,7 @@ test_pst_big3_e (struct pst_big3 x)
> >
> >  /*
> >  ** test_pst_big3_f: { target lp64 }
> > -** ptrue   (p[0-7])\.b, vl16
> > -** ld1b    z0\.b, \1/z, \[x0, #5, mul vl\]
> > +** ldr     q0, \[x0, 80\]
> >  ** ret
> >  */
> >  /*
> > @@ -1035,8 +1007,7 @@ SEL2 (struct, nonpst6)
> >
> >  /*
> >  ** test_nonpst6: { target lp64 }
> > -** ptrue   (p[0-3])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  /*
> > @@ -1063,8 +1034,7 @@ SEL2 (struct, nonpst7)
> >
> >  /*
> >  ** test_nonpst7: { target lp64 }
> > -** ptrue   (p[0-3])\.b, vl16
> > -** ld1d    z0\.d, \1/z, \[x0\]
> > +** ldr     q0, \[x0\]
> >  ** ret
> >  */
> >  /*

RE: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS

Reply via email to