STR for 128-bit VLS

Richard Sandiford Mon, 28 Apr 2025 06:36:44 -0700

Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> On 25 Apr 2025, at 19:55, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a
>>> ptrue predicate can be replaced by neon instructions (LDR and STR),
>>> thus avoiding the predicate altogether. This also enables formation of
>>> LDP/STP pairs.
>>> 
>>> For example, the test cases
>>> 
>>> svfloat64_t
>>> ptrue_load (float64_t *x)
>>> {
>>>  svbool_t pg = svptrue_b64 ();
>>>  return svld1_f64 (pg, x);
>>> }
>>> void
>>> ptrue_store (float64_t *x, svfloat64_t data)
>>> {
>>>  svbool_t pg = svptrue_b64 ();
>>>  return svst1_f64 (pg, x, data);
>>> }
>>> 
>>> were previously compiled to
>>> (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128):
>>> 
>>> ptrue_load:
>>>        ptrue   p3.b, vl16
>>>        ld1d    z0.d, p3/z, [x0]
>>>        ret
>>> ptrue_store:
>>>        ptrue   p3.b, vl16
>>>        st1d    z0.d, p3, [x0]
>>>        ret
>>> 
>>> Now the are compiled to:
>>> 
>>> ptrue_load:
>>>        ldr     q0, [x0]
>>>        ret
>>> ptrue_store:
>>>        str     q0, [x0]
>>>        ret
>>> 
>>> The implementation includes the if-statement
>>> if (known_eq (BYTES_PER_SVE_VECTOR, 16)
>>>    && known_eq (GET_MODE_SIZE (mode), 16))
>>> 
>>> which checks for 128-bit VLS and excludes partial modes with a
>>> mode size < 128 (e.g. VNx2QI).
>> 
>> I think it would be better to use:
>> 
>> if (known_eq (GET_MODE_SIZE (mode), 16)
>>    && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA
>> 
>> to defend against any partial structure modes that might be added in future.
>> 
>>> 
>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>> 
>>> gcc/
>>> * config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move):
>>> Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS.
>>> 
>>> gcc/testsuite/
>>> * gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test.
>>> * gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome.
>>> * gcc.target/aarch64/sve/pst/return_4_128.c: Likewise.
>>> * gcc.target/aarch64/sve/pst/return_5_128.c: Likewise.
>>> * gcc.target/aarch64/sve/pst/struct_3_128.c: Likewise.
>>> ---
>>> gcc/config/aarch64/aarch64.cc                 | 27 ++++++--
>>> .../gcc.target/aarch64/sve/cond_arith_6.c     |  3 +-
>>> .../aarch64/sve/ldst_ptrue_128_to_neon.c      | 36 +++++++++++
>>> .../gcc.target/aarch64/sve/pcs/return_4_128.c | 39 ++++-------
>>> .../gcc.target/aarch64/sve/pcs/return_5_128.c | 39 ++++-------
>>> .../gcc.target/aarch64/sve/pcs/struct_3_128.c | 64 +++++--------------
>>> 6 files changed, 102 insertions(+), 106 deletions(-)
>>> create mode 100644 
>>> gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index f7bccf532f8..ac01149276b 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -6416,13 +6416,28 @@ aarch64_stack_protect_canary_mem (machine_mode 
>>> mode, rtx decl_rtl,
>>> void
>>> aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src)
>>> {
>>> -  expand_operand ops[3];
>>>   machine_mode mode = GET_MODE (dest);
>>> -  create_output_operand (&ops[0], dest, mode);
>>> -  create_input_operand (&ops[1], pred, GET_MODE(pred));
>>> -  create_input_operand (&ops[2], src, mode);
>>> -  temporary_volatile_ok v (true);
>>> -  expand_insn (code_for_aarch64_pred_mov (mode), 3, ops);
>>> +  if ((MEM_P (dest) || MEM_P (src))
>>> +      && known_eq (BYTES_PER_SVE_VECTOR, 16)
>>> +      && known_eq (GET_MODE_SIZE (mode), 16)
>>> +      && !BYTES_BIG_ENDIAN)
>>> +    {
>>> +      rtx tmp = gen_reg_rtx (V16QImode);
>>> +      emit_move_insn (tmp, lowpart_subreg (V16QImode, src, mode));
>>> +      if (MEM_P (src))
>>> + emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
>>> +      else
>>> + emit_move_insn (adjust_address (dest, V16QImode, 0), tmp);
>> 
>> We shouldn't usually need a temporary register for the store case.
>> Also, using lowpart_subreg for a source memory leads to the best-avoided
>> subregs of mems when the mem is volatile, due to:
>> 
>>      /* Allow splitting of volatile memory references in case we don't
>>         have instruction to move the whole thing.  */
>>      && (! MEM_VOLATILE_P (op)
>>  || ! have_insn_for (SET, innermode))
>> 
>> in simplify_subreg.  So how about:
>> 
>>      if (MEM_P (src))
>> {
>>  rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0));
>>  emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
>> }
>>      else
>> emit_move_insn (adjust_address (dest, V16QImode, 0),
>> force_lowpart_subreg (V16QImode, src, mode));
>> 
>> It might be good to test the volatile case too.  That case does work
>> with your patch, since the subreg gets ironed out later.  It's just for
>> completeness.
>
> I don’t disagree with the suggestion here, just an observation on testing.
>
> Out of interest I tried adding volatile in godbolt and got a warning about 
> discarded volatile qualifiers in C:
> https://godbolt.org/z/9bj4W39MP
> And an error in C++ about invalid conversion
> https://godbolt.org/z/7vf3r58ja
>
> How would you recommend a test is written here?


When doing the review, I just used pointer dereferencing:

  svint8_t f1(volatile svint8_t *ptr) { return *ptr; }
  void f2(volatile svint8_t *ptr, svint8_t x) { *ptr = x; }

Thanks,
Richard

Re: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS

Reply via email to