On Thu, Jul 10, 2025 at 6:22 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > LD1Q gathers and ST1Q scatters are unusual in that they operate > on 128-bit blocks (effectively VNx1TI). However, we don't have > modes or ACLE types for 128-bit integers, and 128-bit integers > are not the intended use case. Instead, the instructions are > intended to be used in "hybrid VLA" operations, where each 128-bit > block is an Advanced SIMD vector. > > The normal SVE modes therefore capture the intended use case better > than VNx1TI would. For example, VNx2DI is effectively N copies > of V2DI, VNx4SI N copies of V4SI, etc. > > Since there is only one LD1Q instruction and one ST1Q instruction, > the ACLE support used a single pattern for each, with the loaded or > stored data having mode VNx2DI. The ST1Q pattern was generated by: > > rtx data = e.args.last (); > e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data)); > e.prepare_gather_address_operands (1, false); > return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q); > > where the force_lowpart_subreg bitcast the stored data to VNx2DI. > But such subregs require an element reverse on big-endian targets > (see the comment at the head of aarch64-sve.md), which wasn't the > intention. The code should have used aarch64_sve_reinterpret instead. > > The LD1Q pattern was used as follows: > > e.prepare_gather_address_operands (1, false); > return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q); > > which always returns a VNx2DI value, leaving the caller to bitcast > that to the correct mode. That bitcast again uses subregs and has > the same problem as above. > > However, for the reasons explained in the comment, using > aarch64_sve_reinterpret does not work well for LD1Q. The patch > instead parameterises the LD1Q based on the required data mode. > > Tested on aarch64-linux-gnu and aarch64_be-elf. OK to install?
Ok. > > Richard > > > gcc/ > * config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace > with... > (@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode. > * config/aarch64/aarch64-sve-builtins-sve2.cc > (svld1q_gather_impl::expand): Update accordingly. > (svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret > instead of force_lowpart_subreg. > --- > .../aarch64/aarch64-sve-builtins-sve2.cc | 5 +++-- > gcc/config/aarch64/aarch64-sve2.md | 21 +++++++++++++------ > 2 files changed, 18 insertions(+), 8 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > index d9922de7ca5..abe21a8b61c 100644 > --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > @@ -316,7 +316,8 @@ public: > expand (function_expander &e) const override > { > e.prepare_gather_address_operands (1, false); > - return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q); > + auto icode = code_for_aarch64_gather_ld1q (e.tuple_mode (0)); > + return e.use_exact_insn (icode); > } > }; > > @@ -722,7 +723,7 @@ public: > expand (function_expander &e) const override > { > rtx data = e.args.last (); > - e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE > (data)); > + e.args.last () = aarch64_sve_reinterpret (VNx2DImode, data); > e.prepare_gather_address_operands (1, false); > return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q); > } > diff --git a/gcc/config/aarch64/aarch64-sve2.md > b/gcc/config/aarch64/aarch64-sve2.md > index 62524f36de6..f39a0a964f2 100644 > --- a/gcc/config/aarch64/aarch64-sve2.md > +++ b/gcc/config/aarch64/aarch64-sve2.md > @@ -334,12 +334,21 @@ (define_insn "@aarch64_<optab><mode>_strided4" > ;; - LD1Q (SVE2p1) > ;; ------------------------------------------------------------------------- > > -;; Model this as operating on the largest valid element size, which is DI. > -;; This avoids having to define move patterns & more for VNx1TI, which would > -;; be difficult without a non-gather form of LD1Q. > -(define_insn "aarch64_gather_ld1q" > - [(set (match_operand:VNx2DI 0 "register_operand") > - (unspec:VNx2DI > +;; For little-endian targets, it would be enough to use a single pattern, > +;; with a subreg to bitcast the result to whatever mode is needed. > +;; However, on big-endian targets, the bitcast would need to be an > +;; aarch64_sve_reinterpret instruction. That would interact badly > +;; with the "&" and "?" constraints in this pattern: if the result > +;; of the reinterpret needs to be in the same register as the index, > +;; the RA would tend to prefer to allocate a separate register for the > +;; intermediate (uncast) result, even if the reinterpret prefers tying. > +;; > +;; The index is logically VNx1DI rather than VNx2DI, but introducing > +;; and using VNx1DI would just create more bitcasting. The ACLE intrinsic > +;; uses svuint64_t, which corresponds to VNx2DI. > +(define_insn "@aarch64_gather_ld1q<mode>" > + [(set (match_operand:SVE_FULL 0 "register_operand") > + (unspec:SVE_FULL > [(match_operand:VNx2BI 1 "register_operand") > (match_operand:DI 2 "aarch64_reg_or_zero") > (match_operand:VNx2DI 3 "register_operand") > -- > 2.43.0 >