https://gcc.gnu.org/g:e7f049471c6caf22c65ac48773d864fca7a4cdc4
commit r16-2178-ge7f049471c6caf22c65ac48773d864fca7a4cdc4 Author: Richard Sandiford <richard.sandif...@arm.com> Date: Thu Jul 10 16:54:45 2025 +0100 aarch64: Fix LD1Q and ST1Q failures for big-endian LD1Q gathers and ST1Q scatters are unusual in that they operate on 128-bit blocks (effectively VNx1TI). However, we don't have modes or ACLE types for 128-bit integers, and 128-bit integers are not the intended use case. Instead, the instructions are intended to be used in "hybrid VLA" operations, where each 128-bit block is an Advanced SIMD vector. The normal SVE modes therefore capture the intended use case better than VNx1TI would. For example, VNx2DI is effectively N copies of V2DI, VNx4SI N copies of V4SI, etc. Since there is only one LD1Q instruction and one ST1Q instruction, the ACLE support used a single pattern for each, with the loaded or stored data having mode VNx2DI. The ST1Q pattern was generated by: rtx data = e.args.last (); e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data)); e.prepare_gather_address_operands (1, false); return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q); where the force_lowpart_subreg bitcast the stored data to VNx2DI. But such subregs require an element reverse on big-endian targets (see the comment at the head of aarch64-sve.md), which wasn't the intention. The code should have used aarch64_sve_reinterpret instead. The LD1Q pattern was used as follows: e.prepare_gather_address_operands (1, false); return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q); which always returns a VNx2DI value, leaving the caller to bitcast that to the correct mode. That bitcast again uses subregs and has the same problem as above. However, for the reasons explained in the comment, using aarch64_sve_reinterpret does not work well for LD1Q. The patch instead parameterises the LD1Q based on the required data mode. gcc/ * config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace with... (@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode. * config/aarch64/aarch64-sve-builtins-sve2.cc (svld1q_gather_impl::expand): Update accordingly. (svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret instead of force_lowpart_subreg. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-sve2.cc | 5 +++-- gcc/config/aarch64/aarch64-sve2.md | 21 +++++++++++++++------ 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc index d9922de7ca5a..abe21a8b61c6 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc @@ -316,7 +316,8 @@ public: expand (function_expander &e) const override { e.prepare_gather_address_operands (1, false); - return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q); + auto icode = code_for_aarch64_gather_ld1q (e.tuple_mode (0)); + return e.use_exact_insn (icode); } }; @@ -722,7 +723,7 @@ public: expand (function_expander &e) const override { rtx data = e.args.last (); - e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data)); + e.args.last () = aarch64_sve_reinterpret (VNx2DImode, data); e.prepare_gather_address_operands (1, false); return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q); } diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md index 789ec0dd1a3c..660901d4b3f1 100644 --- a/gcc/config/aarch64/aarch64-sve2.md +++ b/gcc/config/aarch64/aarch64-sve2.md @@ -334,12 +334,21 @@ ;; - LD1Q (SVE2p1) ;; ------------------------------------------------------------------------- -;; Model this as operating on the largest valid element size, which is DI. -;; This avoids having to define move patterns & more for VNx1TI, which would -;; be difficult without a non-gather form of LD1Q. -(define_insn "aarch64_gather_ld1q" - [(set (match_operand:VNx2DI 0 "register_operand") - (unspec:VNx2DI +;; For little-endian targets, it would be enough to use a single pattern, +;; with a subreg to bitcast the result to whatever mode is needed. +;; However, on big-endian targets, the bitcast would need to be an +;; aarch64_sve_reinterpret instruction. That would interact badly +;; with the "&" and "?" constraints in this pattern: if the result +;; of the reinterpret needs to be in the same register as the index, +;; the RA would tend to prefer to allocate a separate register for the +;; intermediate (uncast) result, even if the reinterpret prefers tying. +;; +;; The index is logically VNx1DI rather than VNx2DI, but introducing +;; and using VNx1DI would just create more bitcasting. The ACLE intrinsic +;; uses svuint64_t, which corresponds to VNx2DI. +(define_insn "@aarch64_gather_ld1q<mode>" + [(set (match_operand:SVE_FULL 0 "register_operand") + (unspec:SVE_FULL [(match_operand:VNx2BI 1 "register_operand") (match_operand:DI 2 "aarch64_reg_or_zero") (match_operand:VNx2DI 3 "register_operand")