> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Friday, April 25, 2025 6:55 PM > To: Jennifer Schmitz <jschm...@nvidia.com> > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit > VLS > > Jennifer Schmitz <jschm...@nvidia.com> writes: > > If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a > > ptrue predicate can be replaced by neon instructions (LDR and STR), > > thus avoiding the predicate altogether. This also enables formation of > > LDP/STP pairs. > > > > For example, the test cases > > > > svfloat64_t > > ptrue_load (float64_t *x) > > { > > svbool_t pg = svptrue_b64 (); > > return svld1_f64 (pg, x); > > } > > void > > ptrue_store (float64_t *x, svfloat64_t data) > > { > > svbool_t pg = svptrue_b64 (); > > return svst1_f64 (pg, x, data); > > } > > > > were previously compiled to > > (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128): > > > > ptrue_load: > > ptrue p3.b, vl16 > > ld1d z0.d, p3/z, [x0] > > ret > > ptrue_store: > > ptrue p3.b, vl16 > > st1d z0.d, p3, [x0] > > ret > > > > Now the are compiled to: > > > > ptrue_load: > > ldr q0, [x0] > > ret > > ptrue_store: > > str q0, [x0] > > ret > > > > The implementation includes the if-statement > > if (known_eq (BYTES_PER_SVE_VECTOR, 16) > > && known_eq (GET_MODE_SIZE (mode), 16)) > > > > which checks for 128-bit VLS and excludes partial modes with a > > mode size < 128 (e.g. VNx2QI). > > I think it would be better to use: > > if (known_eq (GET_MODE_SIZE (mode), 16) > && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA > > to defend against any partial structure modes that might be added in future. >
Hi Both, Just a suggestion so feel free to ignore, but I do wonder if this optimization shouldn't look at the predicate bits rather than the mode size. Since this is valid for any load where the predicate uses the lower N bits where n corresponds to an Adv. SIMD register size. e.g. it should be valid for: #include <arm_sve.h> svfloat64_t ptrue_load (float64_t *x) { svbool_t pg = svptrue_pat_b8 (SV_VL16); return svld1_f64 (pg, x); } void ptrue_store (float64_t *x, svfloat64_t data) { svbool_t pg = svptrue_pat_b8 (SV_VL16); return svst1_f64 (pg, x, data); } In general, along with #include <arm_sve.h> svfloat64_t ptrue_load (float64_t *x) { svbool_t pg = svptrue_pat_b8 (SV_VL8); return svld1_f64 (pg, x); } void ptrue_store (float64_t *x, svfloat64_t data) { svbool_t pg = svptrue_pat_b8 (SV_VL8); return svst1_f64 (pg, x, data); } It just so happens that at VL128 the SV_VL16 == SV_ALL. Looking at the predicate bits instead would help optimize all codegen. Thanks, Tamar > > > > The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. > > OK for mainline? > > > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > > > gcc/ > > * config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move): > > Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS. > > > > gcc/testsuite/ > > * gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test. > > * gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome. > > * gcc.target/aarch64/sve/pst/return_4_128.c: Likewise. > > * gcc.target/aarch64/sve/pst/return_5_128.c: Likewise. > > * gcc.target/aarch64/sve/pst/struct_3_128.c: Likewise. > > --- > > gcc/config/aarch64/aarch64.cc | 27 ++++++-- > > .../gcc.target/aarch64/sve/cond_arith_6.c | 3 +- > > .../aarch64/sve/ldst_ptrue_128_to_neon.c | 36 +++++++++++ > > .../gcc.target/aarch64/sve/pcs/return_4_128.c | 39 ++++------- > > .../gcc.target/aarch64/sve/pcs/return_5_128.c | 39 ++++------- > > .../gcc.target/aarch64/sve/pcs/struct_3_128.c | 64 +++++-------------- > > 6 files changed, 102 insertions(+), 106 deletions(-) > > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index f7bccf532f8..ac01149276b 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -6416,13 +6416,28 @@ aarch64_stack_protect_canary_mem > (machine_mode mode, rtx decl_rtl, > > void > > aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src) > > { > > - expand_operand ops[3]; > > machine_mode mode = GET_MODE (dest); > > - create_output_operand (&ops[0], dest, mode); > > - create_input_operand (&ops[1], pred, GET_MODE(pred)); > > - create_input_operand (&ops[2], src, mode); > > - temporary_volatile_ok v (true); > > - expand_insn (code_for_aarch64_pred_mov (mode), 3, ops); > > + if ((MEM_P (dest) || MEM_P (src)) > > + && known_eq (BYTES_PER_SVE_VECTOR, 16) > > + && known_eq (GET_MODE_SIZE (mode), 16) > > + && !BYTES_BIG_ENDIAN) > > + { > > + rtx tmp = gen_reg_rtx (V16QImode); > > + emit_move_insn (tmp, lowpart_subreg (V16QImode, src, mode)); > > + if (MEM_P (src)) > > + emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode)); > > + else > > + emit_move_insn (adjust_address (dest, V16QImode, 0), tmp); > > We shouldn't usually need a temporary register for the store case. > Also, using lowpart_subreg for a source memory leads to the best-avoided > subregs of mems when the mem is volatile, due to: > > /* Allow splitting of volatile memory references in case we don't > have instruction to move the whole thing. */ > && (! MEM_VOLATILE_P (op) > || ! have_insn_for (SET, innermode)) > > in simplify_subreg. So how about: > > if (MEM_P (src)) > { > rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0)); > emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode)); > } > else > emit_move_insn (adjust_address (dest, V16QImode, 0), > force_lowpart_subreg (V16QImode, src, mode)); > > It might be good to test the volatile case too. That case does work > with your patch, since the subreg gets ironed out later. It's just for > completeness. > > Thanks, > Richard > > > + } > > + else > > + { > > + expand_operand ops[3]; > > + create_output_operand (&ops[0], dest, mode); > > + create_input_operand (&ops[1], pred, GET_MODE(pred)); > > + create_input_operand (&ops[2], src, mode); > > + temporary_volatile_ok v (true); > > + expand_insn (code_for_aarch64_pred_mov (mode), 3, ops); > > + } > > } > > > > /* Expand a pre-RA SVE data move from SRC to DEST in which at least one > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c > b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c > > index 4085ab12444..d5a12f1df07 100644 > > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_6.c > > @@ -8,7 +8,8 @@ f (float *x) > > x[i] -= 1.0f; > > } > > > > -/* { dg-final { scan-assembler {\tld1w\tz} } } */ > > +/* { dg-final { scan-assembler {\tld1w\tz} { target aarch64_big_endian } } > > } */ > > +/* { dg-final { scan-assembler {\tldr\tq} { target aarch64_little_endian } > > } } */ > > /* { dg-final { scan-assembler {\tfcmgt\tp} } } */ > > /* { dg-final { scan-assembler {\tfsub\tz} } } */ > > /* { dg-final { scan-assembler {\tst1w\tz} } } */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c > b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c > > new file mode 100644 > > index 00000000000..69f42b121ad > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c > > @@ -0,0 +1,36 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -msve-vector-bits=128" } */ > > +/* { dg-require-effective-target aarch64_little_endian } */ > > + > > +#include <arm_sve.h> > > + > > +#define TEST(TYPE, TY, B) \ > > + sv##TYPE \ > > + ld1_##TY##B (TYPE *x) \ > > + { \ > > + svbool_t pg = svptrue_b##B (); \ > > + return svld1_##TY##B (pg, x); \ > > + } \ > > + \ > > + void \ > > + st1_##TY##B (TYPE *x, sv##TYPE data) \ > > + { \ > > + svbool_t pg = svptrue_b##B (); \ > > + return svst1_##TY##B (pg, x, data); \ > > + } \ > > + > > +TEST (bfloat16_t, bf, 16) > > +TEST (float16_t, f, 16) > > +TEST (float32_t, f, 32) > > +TEST (float64_t, f, 64) > > +TEST (int8_t, s, 8) > > +TEST (int16_t, s, 16) > > +TEST (int32_t, s, 32) > > +TEST (int64_t, s, 64) > > +TEST (uint8_t, u, 8) > > +TEST (uint16_t, u, 16) > > +TEST (uint32_t, u, 32) > > +TEST (uint64_t, u, 64) > > + > > +/* { dg-final { scan-assembler-times {\tldr\tq0, \[x0\]} 12 } } */ > > +/* { dg-final { scan-assembler-times {\tstr\tq0, \[x0\]} 12 } } */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c > b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c > > index 87d528c84cd..ac5f981490a 100644 > > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c > > @@ -11,104 +11,91 @@ > > > > /* > > ** callee_s8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s8, __SVInt8_t) > > > > /* > > ** callee_u8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u8, __SVUint8_t) > > > > /* > > ** callee_mf8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (mf8, __SVMfloat8_t) > > > > /* > > ** callee_s16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s16, __SVInt16_t) > > > > /* > > ** callee_u16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u16, __SVUint16_t) > > > > /* > > ** callee_f16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f16, __SVFloat16_t) > > > > /* > > ** callee_bf16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (bf16, __SVBfloat16_t) > > > > /* > > ** callee_s32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s32, __SVInt32_t) > > > > /* > > ** callee_u32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u32, __SVUint32_t) > > > > /* > > ** callee_f32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f32, __SVFloat32_t) > > > > /* > > ** callee_s64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s64, __SVInt64_t) > > > > /* > > ** callee_u64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u64, __SVUint64_t) > > > > /* > > ** callee_f64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f64, __SVFloat64_t) > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c > b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c > > index 347a16c1367..2fab6feb41c 100644 > > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c > > @@ -13,104 +13,91 @@ > > > > /* > > ** callee_s8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s8, svint8_t) > > > > /* > > ** callee_u8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u8, svuint8_t) > > > > /* > > ** callee_mf8: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (mf8, svmfloat8_t) > > > > /* > > ** callee_s16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s16, svint16_t) > > > > /* > > ** callee_u16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u16, svuint16_t) > > > > /* > > ** callee_f16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f16, svfloat16_t) > > > > /* > > ** callee_bf16: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1h z0\.h, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (bf16, svbfloat16_t) > > > > /* > > ** callee_s32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s32, svint32_t) > > > > /* > > ** callee_u32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u32, svuint32_t) > > > > /* > > ** callee_f32: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1w z0\.s, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f32, svfloat32_t) > > > > /* > > ** callee_s64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (s64, svint64_t) > > > > /* > > ** callee_u64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (u64, svuint64_t) > > > > /* > > ** callee_f64: > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > CALLEE (f64, svfloat64_t) > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c > b/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c > > index d99ce1202a9..370bd9e3bfe 100644 > > --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/struct_3_128.c > > @@ -473,17 +473,11 @@ SEL2 (struct, pst_uniform4) > > ** sub sp, sp, #144 > > ** add (x[0-9]+), sp, #?31 > > ** and x7, \1, #?(?:-32|4294967264) > > -** ptrue (p[0-7])\.b, vl16 > > -** st1w z0\.s, \2, \[x7\] > > -** add (x[0-9]+), x7, #?32 > > -** ( > > -** str z1, \[\3\] > > -** str z2, \[\3, #1, mul vl\] > > -** | > > -** stp q1, q2, \[\3\] > > -** ) > > -** str z3, \[\3, #2, mul vl\] > > -** st1w z4\.s, \2, \[x7, #6, mul vl\] > > +** mov x0, x7 > > +** str q0, \[x0\], 32 > > +** stp q1, q2, \[x0\] > > +** str z3, \[x0, #2, mul vl\] > > +** str q4, \[x7, 96\] > > ** add sp, sp, #?144 > > ** ret > > */ > > @@ -516,20 +510,12 @@ SEL2 (struct, pst_mixed1) > > ** test_pst_mixed1: > > ** sub sp, sp, #176 > > ** str p0, \[sp\] > > -** ptrue p0\.b, vl16 > > -** st1h z0\.h, p0, \[sp, #1, mul vl\] > > -** st1h z1\.h, p0, \[sp, #2, mul vl\] > > -** st1w z2\.s, p0, \[sp, #3, mul vl\] > > -** st1d z3\.d, p0, \[sp, #4, mul vl\] > > +** stp q0, q1, \[sp, 16\] > > +** stp q2, q3, \[sp, 48\] > > ** str p1, \[sp, #40, mul vl\] > > ** str p2, \[sp, #41, mul vl\] > > -** st1b z4\.b, p0, \[sp, #6, mul vl\] > > -** st1h z5\.h, p0, \[sp, #7, mul vl\] > > -** ... > > -** st1w z6\.s, p0, [^\n]* > > -** ... > > -** st1d z7\.d, p0, [^\n]* > > -** ... > > +** stp q4, q5, \[sp, 96\] > > +** stp q6, q7, \[sp, 128\] > > ** str p3, \[sp, #80, mul vl\] > > ** mov (x7, sp|w7, wsp) > > ** add sp, sp, #?176 > > @@ -557,24 +543,13 @@ SEL2 (struct, pst_mixed2) > > ** test_pst_mixed2: > > ** sub sp, sp, #128 > > ** str p0, \[sp\] > > -** ptrue (p[03])\.b, vl16 > > -** add (x[0-9]+), sp, #?2 > > -** st1b z0\.b, \1, \[\2\] > > +** str q0, \[sp, 2\] > > ** str p1, \[sp, #9, mul vl\] > > -** add (x[0-9]+), sp, #?20 > > -** st1b z1\.b, \1, \[\3\] > > +** str q1, \[sp, 20\] > > ** str p2, \[sp, #18, mul vl\] > > -** add (x[0-9]+), sp, #?38 > > -** st1b z2\.b, \1, \[\4\] > > -** ( > > -** str z3, \[sp, #4, mul vl\] > > -** str z4, \[sp, #5, mul vl\] > > -** str z5, \[sp, #6, mul vl\] > > -** str z6, \[sp, #7, mul vl\] > > -** | > > +** str q2, \[sp, 38\] > > ** stp q3, q4, \[sp, 64\] > > ** stp q5, q6, \[sp, 96\] > > -** ) > > ** mov (x7, sp|w7, wsp) > > ** add sp, sp, #?128 > > ** ret > > @@ -595,8 +570,7 @@ SEL2 (struct, pst_big1) > > > > /* > > ** test_pst_big1_a: { target lp64 } > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > /* > > @@ -760,8 +734,7 @@ test_pst_big3_d (struct pst_big3 x) > > > > /* > > ** test_pst_big3_e: { target lp64 } > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0, #1, mul vl\] > > +** ldr q0, \[x0, 16\] > > ** ret > > */ > > /* > > @@ -780,8 +753,7 @@ test_pst_big3_e (struct pst_big3 x) > > > > /* > > ** test_pst_big3_f: { target lp64 } > > -** ptrue (p[0-7])\.b, vl16 > > -** ld1b z0\.b, \1/z, \[x0, #5, mul vl\] > > +** ldr q0, \[x0, 80\] > > ** ret > > */ > > /* > > @@ -1035,8 +1007,7 @@ SEL2 (struct, nonpst6) > > > > /* > > ** test_nonpst6: { target lp64 } > > -** ptrue (p[0-3])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > /* > > @@ -1063,8 +1034,7 @@ SEL2 (struct, nonpst7) > > > > /* > > ** test_nonpst7: { target lp64 } > > -** ptrue (p[0-3])\.b, vl16 > > -** ld1d z0\.d, \1/z, \[x0\] > > +** ldr q0, \[x0\] > > ** ret > > */ > > /*