[gcc r16-2178] aarch64: Fix LD1Q and ST1Q failures for big-endian

Richard Sandiford via Gcc-cvs Thu, 10 Jul 2025 08:55:05 -0700

https://gcc.gnu.org/g:e7f049471c6caf22c65ac48773d864fca7a4cdc4


commit r16-2178-ge7f049471c6caf22c65ac48773d864fca7a4cdc4
Author: Richard Sandiford <richard.sandif...@arm.com>
Date:   Thu Jul 10 16:54:45 2025 +0100

    aarch64: Fix LD1Q and ST1Q failures for big-endian
    
    LD1Q gathers and ST1Q scatters are unusual in that they operate
    on 128-bit blocks (effectively VNx1TI).  However, we don't have
    modes or ACLE types for 128-bit integers, and 128-bit integers
    are not the intended use case.  Instead, the instructions are
    intended to be used in "hybrid VLA" operations, where each 128-bit
    block is an Advanced SIMD vector.
    
    The normal SVE modes therefore capture the intended use case better
    than VNx1TI would.  For example, VNx2DI is effectively N copies
    of V2DI, VNx4SI N copies of V4SI, etc.
    
    Since there is only one LD1Q instruction and one ST1Q instruction,
    the ACLE support used a single pattern for each, with the loaded or
    stored data having mode VNx2DI.  The ST1Q pattern was generated by:
    
        rtx data = e.args.last ();
        e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE 
(data));
        e.prepare_gather_address_operands (1, false);
        return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q);
    
    where the force_lowpart_subreg bitcast the stored data to VNx2DI.
    But such subregs require an element reverse on big-endian targets
    (see the comment at the head of aarch64-sve.md), which wasn't the
    intention.  The code should have used aarch64_sve_reinterpret instead.
    
    The LD1Q pattern was used as follows:
    
        e.prepare_gather_address_operands (1, false);
        return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q);
    
    which always returns a VNx2DI value, leaving the caller to bitcast
    that to the correct mode.  That bitcast again uses subregs and has
    the same problem as above.
    
    However, for the reasons explained in the comment, using
    aarch64_sve_reinterpret does not work well for LD1Q.  The patch
    instead parameterises the LD1Q based on the required data mode.
    
    gcc/
            * config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace 
with...
            (@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode.
            * config/aarch64/aarch64-sve-builtins-sve2.cc
            (svld1q_gather_impl::expand): Update accordingly.
            (svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret
            instead of force_lowpart_subreg.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc |  5 +++--
 gcc/config/aarch64/aarch64-sve2.md              | 21 +++++++++++++++------
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index d9922de7ca5a..abe21a8b61c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -316,7 +316,8 @@ public:
   expand (function_expander &e) const override
   {
     e.prepare_gather_address_operands (1, false);
-    return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q);
+    auto icode = code_for_aarch64_gather_ld1q (e.tuple_mode (0));
+    return e.use_exact_insn (icode);
   }
 };
 
@@ -722,7 +723,7 @@ public:
   expand (function_expander &e) const override
   {
     rtx data = e.args.last ();
-    e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data));
+    e.args.last () = aarch64_sve_reinterpret (VNx2DImode, data);
     e.prepare_gather_address_operands (1, false);
     return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q);
   }
diff --git a/gcc/config/aarch64/aarch64-sve2.md 
b/gcc/config/aarch64/aarch64-sve2.md
index 789ec0dd1a3c..660901d4b3f1 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -334,12 +334,21 @@
 ;; - LD1Q (SVE2p1)
 ;; -------------------------------------------------------------------------
 
-;; Model this as operating on the largest valid element size, which is DI.
-;; This avoids having to define move patterns & more for VNx1TI, which would
-;; be difficult without a non-gather form of LD1Q.
-(define_insn "aarch64_gather_ld1q"
-  [(set (match_operand:VNx2DI 0 "register_operand")
-       (unspec:VNx2DI
+;; For little-endian targets, it would be enough to use a single pattern,
+;; with a subreg to bitcast the result to whatever mode is needed.
+;; However, on big-endian targets, the bitcast would need to be an
+;; aarch64_sve_reinterpret instruction.  That would interact badly
+;; with the "&" and "?" constraints in this pattern: if the result
+;; of the reinterpret needs to be in the same register as the index,
+;; the RA would tend to prefer to allocate a separate register for the
+;; intermediate (uncast) result, even if the reinterpret prefers tying.
+;;
+;; The index is logically VNx1DI rather than VNx2DI, but introducing
+;; and using VNx1DI would just create more bitcasting.  The ACLE intrinsic
+;; uses svuint64_t, which corresponds to VNx2DI.
+(define_insn "@aarch64_gather_ld1q<mode>"
+  [(set (match_operand:SVE_FULL 0 "register_operand")
+       (unspec:SVE_FULL
          [(match_operand:VNx2BI 1 "register_operand")
           (match_operand:DI 2 "aarch64_reg_or_zero")
           (match_operand:VNx2DI 3 "register_operand")

[gcc r16-2178] aarch64: Fix LD1Q and ST1Q failures for big-endian

Reply via email to