On Thu, Jul 10, 2025 at 6:22 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> LD1Q gathers and ST1Q scatters are unusual in that they operate
> on 128-bit blocks (effectively VNx1TI).  However, we don't have
> modes or ACLE types for 128-bit integers, and 128-bit integers
> are not the intended use case.  Instead, the instructions are
> intended to be used in "hybrid VLA" operations, where each 128-bit
> block is an Advanced SIMD vector.
>
> The normal SVE modes therefore capture the intended use case better
> than VNx1TI would.  For example, VNx2DI is effectively N copies
> of V2DI, VNx4SI N copies of V4SI, etc.
>
> Since there is only one LD1Q instruction and one ST1Q instruction,
> the ACLE support used a single pattern for each, with the loaded or
> stored data having mode VNx2DI.  The ST1Q pattern was generated by:
>
>     rtx data = e.args.last ();
>     e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data));
>     e.prepare_gather_address_operands (1, false);
>     return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q);
>
> where the force_lowpart_subreg bitcast the stored data to VNx2DI.
> But such subregs require an element reverse on big-endian targets
> (see the comment at the head of aarch64-sve.md), which wasn't the
> intention.  The code should have used aarch64_sve_reinterpret instead.
>
> The LD1Q pattern was used as follows:
>
>     e.prepare_gather_address_operands (1, false);
>     return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q);
>
> which always returns a VNx2DI value, leaving the caller to bitcast
> that to the correct mode.  That bitcast again uses subregs and has
> the same problem as above.
>
> However, for the reasons explained in the comment, using
> aarch64_sve_reinterpret does not work well for LD1Q.  The patch
> instead parameterises the LD1Q based on the required data mode.
>
> Tested on aarch64-linux-gnu and aarch64_be-elf.  OK to install?

Ok.

>
> Richard
>
>
> gcc/
>         * config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace 
> with...
>         (@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode.
>         * config/aarch64/aarch64-sve-builtins-sve2.cc
>         (svld1q_gather_impl::expand): Update accordingly.
>         (svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret
>         instead of force_lowpart_subreg.
> ---
>  .../aarch64/aarch64-sve-builtins-sve2.cc      |  5 +++--
>  gcc/config/aarch64/aarch64-sve2.md            | 21 +++++++++++++------
>  2 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> index d9922de7ca5..abe21a8b61c 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> @@ -316,7 +316,8 @@ public:
>    expand (function_expander &e) const override
>    {
>      e.prepare_gather_address_operands (1, false);
> -    return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q);
> +    auto icode = code_for_aarch64_gather_ld1q (e.tuple_mode (0));
> +    return e.use_exact_insn (icode);
>    }
>  };
>
> @@ -722,7 +723,7 @@ public:
>    expand (function_expander &e) const override
>    {
>      rtx data = e.args.last ();
> -    e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE 
> (data));
> +    e.args.last () = aarch64_sve_reinterpret (VNx2DImode, data);
>      e.prepare_gather_address_operands (1, false);
>      return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q);
>    }
> diff --git a/gcc/config/aarch64/aarch64-sve2.md 
> b/gcc/config/aarch64/aarch64-sve2.md
> index 62524f36de6..f39a0a964f2 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -334,12 +334,21 @@ (define_insn "@aarch64_<optab><mode>_strided4"
>  ;; - LD1Q (SVE2p1)
>  ;; -------------------------------------------------------------------------
>
> -;; Model this as operating on the largest valid element size, which is DI.
> -;; This avoids having to define move patterns & more for VNx1TI, which would
> -;; be difficult without a non-gather form of LD1Q.
> -(define_insn "aarch64_gather_ld1q"
> -  [(set (match_operand:VNx2DI 0 "register_operand")
> -       (unspec:VNx2DI
> +;; For little-endian targets, it would be enough to use a single pattern,
> +;; with a subreg to bitcast the result to whatever mode is needed.
> +;; However, on big-endian targets, the bitcast would need to be an
> +;; aarch64_sve_reinterpret instruction.  That would interact badly
> +;; with the "&" and "?" constraints in this pattern: if the result
> +;; of the reinterpret needs to be in the same register as the index,
> +;; the RA would tend to prefer to allocate a separate register for the
> +;; intermediate (uncast) result, even if the reinterpret prefers tying.
> +;;
> +;; The index is logically VNx1DI rather than VNx2DI, but introducing
> +;; and using VNx1DI would just create more bitcasting.  The ACLE intrinsic
> +;; uses svuint64_t, which corresponds to VNx2DI.
> +(define_insn "@aarch64_gather_ld1q<mode>"
> +  [(set (match_operand:SVE_FULL 0 "register_operand")
> +       (unspec:SVE_FULL
>           [(match_operand:VNx2BI 1 "register_operand")
>            (match_operand:DI 2 "aarch64_reg_or_zero")
>            (match_operand:VNx2DI 3 "register_operand")
> --
> 2.43.0
>

Reply via email to