Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Friday, April 25, 2025 6:55 PM >> To: Jennifer Schmitz <jschm...@nvidia.com> >> Cc: gcc-patches@gcc.gnu.org >> Subject: Re: [PATCH] AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit >> VLS >> >> Jennifer Schmitz <jschm...@nvidia.com> writes: >> > If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a >> > ptrue predicate can be replaced by neon instructions (LDR and STR), >> > thus avoiding the predicate altogether. This also enables formation of >> > LDP/STP pairs. >> > >> > For example, the test cases >> > >> > svfloat64_t >> > ptrue_load (float64_t *x) >> > { >> > svbool_t pg = svptrue_b64 (); >> > return svld1_f64 (pg, x); >> > } >> > void >> > ptrue_store (float64_t *x, svfloat64_t data) >> > { >> > svbool_t pg = svptrue_b64 (); >> > return svst1_f64 (pg, x, data); >> > } >> > >> > were previously compiled to >> > (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128): >> > >> > ptrue_load: >> > ptrue p3.b, vl16 >> > ld1d z0.d, p3/z, [x0] >> > ret >> > ptrue_store: >> > ptrue p3.b, vl16 >> > st1d z0.d, p3, [x0] >> > ret >> > >> > Now the are compiled to: >> > >> > ptrue_load: >> > ldr q0, [x0] >> > ret >> > ptrue_store: >> > str q0, [x0] >> > ret >> > >> > The implementation includes the if-statement >> > if (known_eq (BYTES_PER_SVE_VECTOR, 16) >> > && known_eq (GET_MODE_SIZE (mode), 16)) >> > >> > which checks for 128-bit VLS and excludes partial modes with a >> > mode size < 128 (e.g. VNx2QI). >> >> I think it would be better to use: >> >> if (known_eq (GET_MODE_SIZE (mode), 16) >> && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA >> >> to defend against any partial structure modes that might be added in future. >> > > Hi Both, > > Just a suggestion so feel free to ignore, but I do wonder if this > optimization shouldn't > look at the predicate bits rather than the mode size. Since this is valid > for any load > where the predicate uses the lower N bits where n corresponds to an Adv. SIMD > register > size. > > e.g. it should be valid for: > > #include <arm_sve.h> > > svfloat64_t > ptrue_load (float64_t *x) > { > svbool_t pg = svptrue_pat_b8 (SV_VL16); > return svld1_f64 (pg, x); > } > > void > ptrue_store (float64_t *x, svfloat64_t data) > { > svbool_t pg = svptrue_pat_b8 (SV_VL16); > return svst1_f64 (pg, x, data); > } > > In general, along with > > #include <arm_sve.h> > > svfloat64_t > ptrue_load (float64_t *x) > { > svbool_t pg = svptrue_pat_b8 (SV_VL8); > return svld1_f64 (pg, x); > } > > void > ptrue_store (float64_t *x, svfloat64_t data) > { > svbool_t pg = svptrue_pat_b8 (SV_VL8); > return svst1_f64 (pg, x, data); > } > > It just so happens that at VL128 the SV_VL16 == SV_ALL. Looking at the > predicate bits > instead would help optimize all codegen.
Yeah, I agree we should do that at some point too. But aarch64_emit_sve_pred_move isn't as general as its name suggests, in that the provided predicate is always a ptrue: /* Emit an SVE predicated move from SRC to DEST. PRED is a predicate that is known to contain PTRUE. */ That is, this function is used by code that wants to emit a full vector move. The reason that the predicate is passed in rather than being generated internally is because of aarch64_sve_reload_mem and the post-reload splitters. Thanks, Richard