Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> On 25 Apr 2025, at 19:55, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> Jennifer Schmitz <jschm...@nvidia.com> writes: >>> If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a >>> ptrue predicate can be replaced by neon instructions (LDR and STR), >>> thus avoiding the predicate altogether. This also enables formation of >>> LDP/STP pairs. >>> >>> For example, the test cases >>> >>> svfloat64_t >>> ptrue_load (float64_t *x) >>> { >>> svbool_t pg = svptrue_b64 (); >>> return svld1_f64 (pg, x); >>> } >>> void >>> ptrue_store (float64_t *x, svfloat64_t data) >>> { >>> svbool_t pg = svptrue_b64 (); >>> return svst1_f64 (pg, x, data); >>> } >>> >>> were previously compiled to >>> (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128): >>> >>> ptrue_load: >>> ptrue p3.b, vl16 >>> ld1d z0.d, p3/z, [x0] >>> ret >>> ptrue_store: >>> ptrue p3.b, vl16 >>> st1d z0.d, p3, [x0] >>> ret >>> >>> Now the are compiled to: >>> >>> ptrue_load: >>> ldr q0, [x0] >>> ret >>> ptrue_store: >>> str q0, [x0] >>> ret >>> >>> The implementation includes the if-statement >>> if (known_eq (BYTES_PER_SVE_VECTOR, 16) >>> && known_eq (GET_MODE_SIZE (mode), 16)) >>> >>> which checks for 128-bit VLS and excludes partial modes with a >>> mode size < 128 (e.g. VNx2QI). >> >> I think it would be better to use: >> >> if (known_eq (GET_MODE_SIZE (mode), 16) >> && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA >> >> to defend against any partial structure modes that might be added in future. >> >>> >>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. >>> OK for mainline? >>> >>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>> >>> gcc/ >>> * config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move): >>> Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS. >>> >>> gcc/testsuite/ >>> * gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test. >>> * gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome. >>> * gcc.target/aarch64/sve/pst/return_4_128.c: Likewise. >>> * gcc.target/aarch64/sve/pst/return_5_128.c: Likewise. >>> * gcc.target/aarch64/sve/pst/struct_3_128.c: Likewise. >>> --- >>> gcc/config/aarch64/aarch64.cc | 27 ++++++-- >>> .../gcc.target/aarch64/sve/cond_arith_6.c | 3 +- >>> .../aarch64/sve/ldst_ptrue_128_to_neon.c | 36 +++++++++++ >>> .../gcc.target/aarch64/sve/pcs/return_4_128.c | 39 ++++------- >>> .../gcc.target/aarch64/sve/pcs/return_5_128.c | 39 ++++------- >>> .../gcc.target/aarch64/sve/pcs/struct_3_128.c | 64 +++++-------------- >>> 6 files changed, 102 insertions(+), 106 deletions(-) >>> create mode 100644 >>> gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c >>> >>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>> index f7bccf532f8..ac01149276b 100644 >>> --- a/gcc/config/aarch64/aarch64.cc >>> +++ b/gcc/config/aarch64/aarch64.cc >>> @@ -6416,13 +6416,28 @@ aarch64_stack_protect_canary_mem (machine_mode >>> mode, rtx decl_rtl, >>> void >>> aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src) >>> { >>> - expand_operand ops[3]; >>> machine_mode mode = GET_MODE (dest); >>> - create_output_operand (&ops[0], dest, mode); >>> - create_input_operand (&ops[1], pred, GET_MODE(pred)); >>> - create_input_operand (&ops[2], src, mode); >>> - temporary_volatile_ok v (true); >>> - expand_insn (code_for_aarch64_pred_mov (mode), 3, ops); >>> + if ((MEM_P (dest) || MEM_P (src)) >>> + && known_eq (BYTES_PER_SVE_VECTOR, 16) >>> + && known_eq (GET_MODE_SIZE (mode), 16) >>> + && !BYTES_BIG_ENDIAN) >>> + { >>> + rtx tmp = gen_reg_rtx (V16QImode); >>> + emit_move_insn (tmp, lowpart_subreg (V16QImode, src, mode)); >>> + if (MEM_P (src)) >>> + emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode)); >>> + else >>> + emit_move_insn (adjust_address (dest, V16QImode, 0), tmp); >> >> We shouldn't usually need a temporary register for the store case. >> Also, using lowpart_subreg for a source memory leads to the best-avoided >> subregs of mems when the mem is volatile, due to: >> >> /* Allow splitting of volatile memory references in case we don't >> have instruction to move the whole thing. */ >> && (! MEM_VOLATILE_P (op) >> || ! have_insn_for (SET, innermode)) >> >> in simplify_subreg. So how about: >> >> if (MEM_P (src)) >> { >> rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0)); >> emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode)); >> } >> else >> emit_move_insn (adjust_address (dest, V16QImode, 0), >> force_lowpart_subreg (V16QImode, src, mode)); >> >> It might be good to test the volatile case too. That case does work >> with your patch, since the subreg gets ironed out later. It's just for >> completeness. > > I don’t disagree with the suggestion here, just an observation on testing. > > Out of interest I tried adding volatile in godbolt and got a warning about > discarded volatile qualifiers in C: > https://godbolt.org/z/9bj4W39MP > And an error in C++ about invalid conversion > https://godbolt.org/z/7vf3r58ja > > How would you recommend a test is written here?
When doing the review, I just used pointer dereferencing: svint8_t f1(volatile svint8_t *ptr) { return *ptr; } void f2(volatile svint8_t *ptr, svint8_t x) { *ptr = x; } Thanks, Richard