Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Friday, April 25, 2025 6:55 PM
>> To: Jennifer Schmitz <jschm...@nvidia.com>
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit 
>> VLS
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>> > If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a
>> > ptrue predicate can be replaced by neon instructions (LDR and STR),
>> > thus avoiding the predicate altogether. This also enables formation of
>> > LDP/STP pairs.
>> >
>> > For example, the test cases
>> >
>> > svfloat64_t
>> > ptrue_load (float64_t *x)
>> > {
>> >   svbool_t pg = svptrue_b64 ();
>> >   return svld1_f64 (pg, x);
>> > }
>> > void
>> > ptrue_store (float64_t *x, svfloat64_t data)
>> > {
>> >   svbool_t pg = svptrue_b64 ();
>> >   return svst1_f64 (pg, x, data);
>> > }
>> >
>> > were previously compiled to
>> > (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128):
>> >
>> > ptrue_load:
>> >         ptrue   p3.b, vl16
>> >         ld1d    z0.d, p3/z, [x0]
>> >         ret
>> > ptrue_store:
>> >         ptrue   p3.b, vl16
>> >         st1d    z0.d, p3, [x0]
>> >         ret
>> >
>> > Now the are compiled to:
>> >
>> > ptrue_load:
>> >         ldr     q0, [x0]
>> >         ret
>> > ptrue_store:
>> >         str     q0, [x0]
>> >         ret
>> >
>> > The implementation includes the if-statement
>> > if (known_eq (BYTES_PER_SVE_VECTOR, 16)
>> >     && known_eq (GET_MODE_SIZE (mode), 16))
>> >
>> > which checks for 128-bit VLS and excludes partial modes with a
>> > mode size < 128 (e.g. VNx2QI).
>> 
>> I think it would be better to use:
>> 
>> if (known_eq (GET_MODE_SIZE (mode), 16)
>>     && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA
>> 
>> to defend against any partial structure modes that might be added in future.
>> 
>
> Hi Both,
>
> Just a suggestion so feel free to ignore, but I do wonder if this 
> optimization shouldn't
> look at the predicate bits rather than the mode size.  Since this is valid 
> for any load
> where the predicate uses the lower N bits where n corresponds to an Adv. SIMD 
> register
> size.
>
> e.g. it should be valid for:
>
> #include <arm_sve.h>
>
> svfloat64_t
> ptrue_load (float64_t *x)
> {
>   svbool_t pg = svptrue_pat_b8 (SV_VL16);
>   return svld1_f64 (pg, x);
> }
>
> void
> ptrue_store (float64_t *x, svfloat64_t data)
> {
>   svbool_t pg = svptrue_pat_b8 (SV_VL16);
>   return svst1_f64 (pg, x, data);
> }
>
> In general, along with
>
> #include <arm_sve.h>
>
> svfloat64_t
> ptrue_load (float64_t *x)
> {
>   svbool_t pg = svptrue_pat_b8 (SV_VL8);
>   return svld1_f64 (pg, x);
> }
>
> void
> ptrue_store (float64_t *x, svfloat64_t data)
> {
>   svbool_t pg = svptrue_pat_b8 (SV_VL8);
>   return svst1_f64 (pg, x, data);
> }
>
> It just so happens that at VL128 the SV_VL16 == SV_ALL.  Looking at the 
> predicate bits
> instead would help optimize all codegen.

Yeah, I agree we should do that at some point too.  But
aarch64_emit_sve_pred_move isn't as general as its name suggests,
in that the provided predicate is always a ptrue:

/* Emit an SVE predicated move from SRC to DEST.  PRED is a predicate
   that is known to contain PTRUE.  */

That is, this function is used by code that wants to emit a full vector
move.  The reason that the predicate is passed in rather than being
generated internally is because of aarch64_sve_reload_mem and the
post-reload splitters.

Thanks,
Richard

Reply via email to