[PATCH v12 8/8] aarch64: improve code generation in aarch64_expand_sve_const_vector

Christopher Bazley Fri, 26 Jun 2026 14:50:08 -0700

Prior to this change, the AArch64 code generated by GCC when the basic
block SLP vectorizer had been extended with support for predicated tails
was sometimes noticeably less efficient than equivalent code generated
as a result of loop vectorization.  This change helps to close the gap.


For example, the following AArch64 assembly language was generated
for an eight-byte constant with trailing zeros, {1,2,1,2,1,2,0,0}:

        ptrue   p6.b, all
        ptrue   p15.b, vl8
        adrp    x1, .LC0
        movi    d29, #0                  ; zero-initialise z29
        add     x1, x1, :lo12:.LC0
        ld1rd   z30.d, p6/z, [x1]        ; load 8-byte constant and
                                         ; duplicate it across z30
        sel     z29.b, p15, z30.b, z29.b ; copy first 8 bytes of
                                         ; z30, zero later lanes

when a simple load would have sufficed (because ASIMD register d30
overlaps the lower bits of SVE register z30):

        adrp    x1, .LANCHOR0
        ldr     d30, [x1, #:lo12:.LANCHOR0] ; load 8-byte constant,
                                            ; zero later lanes

The cause was inappropriate use of the aarch64_expand_sve_const_vector_sel
function, which builds a predicate mask and uses it to control whether
the value of each lane of the expanded SVE CONST_VECTOR is copied from the
first or second element of each pattern.

Internally the vector constant from the above example is encoded by GCC as
eight patterns, each of which has two elements: {1,0}, {2,0}, {1,0}, {2,0},
{1,0}, {2,0}, {0,0}, {0,0}.

A new function has been created which checks whether the second element
of every pattern is zero.  If so, and if the number of vector register bits
required to store one element of every pattern is not greater than 128,
then the new function builds a non-scalable vector from only the first
element of each pattern.  For example, a vector with mode V8QI is used to
move an eight-byte constant into a scalable vector that has mode VNx16QI.

gcc/ChangeLog:

        * config/aarch64/aarch64.cc (aarch64_expand_sve_const_vector_lowpart):
        New function that tries to expand a CONST_VECTOR that is encoded as
        N patterns, each pattern comprising a pair of elements in which the
        second element is zero.
        (aarch64_expand_sve_const_vector):
        Try calling aarch64_expand_sve_const_vector_lowpart if each
        pattern comprises a pair of elements, and only use the existing
        aarch64_expand_sve_const_vector_sel function as a fallback.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/sve/slp_pred_1.c:
        Modify an existing test for SLP vectorisation to be stricter
        about which instructions are used (and not used) to put a vector
        constant into an SVE register.
        * gcc.target/aarch64/sve/slp_pred_10.c: New test
        * gcc.target/aarch64/sve/slp_pred_11.c: New test
        * gcc.target/aarch64/sve/slp_pred_12.c: New test
        * gcc.target/aarch64/sve/slp_pred_13.c: New test
        * gcc.target/aarch64/sve/slp_pred_14.c: New test
        * gcc.target/aarch64/sve/slp_pred_15.c: New test
        * gcc.target/aarch64/sve/slp_pred_2.c: Update test
        * gcc.target/aarch64/sve/slp_pred_3.c: Update test
        * gcc.target/aarch64/sve/slp_pred_4.c: Update test
        * gcc.target/aarch64/sve/slp_pred_6.c: Update test
        * gcc.target/aarch64/sve/slp_pred_7.c: Update test
        * gcc.target/aarch64/sve/slp_pred_9.c: New test
---
 gcc/config/aarch64/aarch64.cc                 | 135 +++++++++++++++++-
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  14 +-
 .../gcc.target/aarch64/sve/slp_pred_10.c      |  31 ++++
 .../gcc.target/aarch64/sve/slp_pred_11.c      |  31 ++++
 .../gcc.target/aarch64/sve/slp_pred_12.c      |  31 ++++
 .../gcc.target/aarch64/sve/slp_pred_13.c      |  35 +++++
 .../gcc.target/aarch64/sve/slp_pred_14.c      |  29 ++++
 .../gcc.target/aarch64/sve/slp_pred_15.c      |  29 ++++
 .../gcc.target/aarch64/sve/slp_pred_2.c       |   4 +-
 .../gcc.target/aarch64/sve/slp_pred_3.c       |   4 +-
 .../gcc.target/aarch64/sve/slp_pred_4.c       |   4 +-
 .../gcc.target/aarch64/sve/slp_pred_6.c       |   8 +-
 .../gcc.target/aarch64/sve/slp_pred_7.c       |   8 +-
 .../gcc.target/aarch64/sve/slp_pred_9.c       |  35 +++++
 14 files changed, 377 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_10.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_11.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_12.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_13.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_14.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_15.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_9.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 8dc3c43c812..489579cf304 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6208,6 +6208,104 @@ aarch64_expand_sve_ld1rq (rtx dest, rtx src)
   return true;
 }
 
+/* SRC is an SVE CONST_VECTOR that contains N "foreground" values followed by N
+   "background" values, where N is the number of patterns and the number of
+   elements per pattern is 2.  WIDTH is the number of vector register bits that
+   a complete set of N values would occupy in SRC.  Try to move SRC into TARGET
+   using:
+
+      MOV TARGET.<T>, #<foreground>
+
+   This requires the "background" values to be zero (as they often are) and
+   WIDTH to be small enough such that the "foreground" values can be moved
+   directly into the lower lanes of TARGET.  If this is successful then
+   remaining lanes of TARGET are implicitly zero.
+
+   Return the target on success, otherwise return null.  */
+
+static rtx
+aarch64_expand_sve_const_vector_lowpart (rtx target, rtx src,
+                                        unsigned int width)
+{
+  gcc_assert (CONST_VECTOR_NELTS_PER_PATTERN (src) == 2);
+
+  machine_mode mode = GET_MODE (src);
+  scalar_mode elt_mode = GET_MODE_INNER (mode);
+  unsigned int npatterns = CONST_VECTOR_NPATTERNS (src);
+  unsigned int container_bits = aarch64_sve_container_bits (mode);
+
+  gcc_checking_assert (container_bits * npatterns == width);
+
+  if (width > 128)
+    return NULL_RTX;
+
+  for (unsigned int i = 0; i < npatterns; ++i)
+    if (CONST_VECTOR_ENCODED_ELT (src, i + npatterns) != CONST0_RTX (elt_mode))
+      return NULL_RTX;
+
+  /* Handle partial vector modes.  For example, we can build FOREGROUND_VALUES
+     with V16QImode (16 x 8 bit values) in order to move a vector constant into
+     the LOWPART of a register that has FULL_SVE_MODE == VNx16QImode, then
+     reinterpret that register as having MODE == VNx4QImode (4 x 8 bit values 
in
+     32 bit containers).  Only one in four ELTs would be non-zero.  */
+  unsigned int elt_bits = GET_MODE_BITSIZE (elt_mode);
+  gcc_checking_assert (container_bits % elt_bits == 0);
+  unsigned int nelts_per_container = container_bits / elt_bits;
+
+  machine_mode lowpart_mode;
+  if (container_bits == elt_bits)
+    lowpart_mode = aarch64_simd_container_mode (elt_mode, width);
+  else
+    lowpart_mode = aarch64_v128_mode (elt_mode).require ();
+
+  if (!VECTOR_MODE_P (lowpart_mode))
+    return NULL_RTX;
+
+  unsigned int lowpart_nunits = GET_MODE_NUNITS (lowpart_mode).to_constant ();
+  if (npatterns * nelts_per_container > lowpart_nunits)
+    return NULL_RTX;
+
+  rtx_vector_builder builder (lowpart_mode, lowpart_nunits, 1);
+  for (unsigned int i = 0; i < npatterns; ++i)
+    {
+      rtx elt = CONST_VECTOR_ENCODED_ELT (src, i);
+      builder.quick_push (elt);
+      for (unsigned int p = 1; p < nelts_per_container; ++p)
+       {
+         builder.quick_push (CONST0_RTX (elt_mode));
+       }
+    }
+
+  for (unsigned int i = builder.length (); i < lowpart_nunits; ++i)
+    builder.quick_push (CONST0_RTX (elt_mode));
+
+  rtx foreground_values = builder.build ();
+  rtx val = force_const_mem (lowpart_mode, foreground_values);
+  if (!val)
+    val = foreground_values;
+
+  /* Create a zero-initialized temporary, FULL_SVE, that has the same ELT_MODE
+     as MODE but is definitely not a partial vector mode (e.g. VNx16QImode
+     instead of VNx4QImode).  */
+  machine_mode full_sve_mode = aarch64_full_sve_mode (elt_mode).require ();
+  rtx full_sve = gen_reg_rtx (full_sve_mode);
+  emit_move_insn (full_sve, CONST0_RTX (full_sve_mode));
+
+  /* Move VAL into some low-order bits of FULL_SVE, where the number of
+     low-order bits to replace is given by LOWPART_MODE (e.g. V4QImode).  */
+  rtx lowpart = gen_lowpart_common (lowpart_mode, full_sve);
+  if (!lowpart)
+    return NULL_RTX;
+
+  emit_move_insn (lowpart, val);
+
+  /* Reinterpret the temporary that has FULL_SVE_MODE (e.g. VNx16QImode) as if
+     it instead had MODE (e.g. VNx4QImode).  */
+  target = aarch64_target_reg (target, mode);
+  emit_insn (gen_aarch64_sve_reinterpret (mode, target, full_sve));
+  return target;
+}
+
 /* SRC is an SVE CONST_VECTOR that contains N "foreground" values followed
    by N "background" values.  Try to move it into TARGET using:
 
@@ -6394,8 +6492,14 @@ aarch64_expand_sve_const_vector (rtx target, rtx src)
     return NULL_RTX;
 
   if (nelts_per_pattern == 2)
-    if (rtx res = aarch64_expand_sve_const_vector_sel (target, src))
-      return res;
+    {
+      if (rtx res = aarch64_expand_sve_const_vector_lowpart (
+           target, src, encoded_bits / nelts_per_pattern))
+       return res;
+
+      if (rtx res = aarch64_expand_sve_const_vector_sel (target, src))
+       return res;
+    }
 
   /* Expand each pattern individually.  */
   gcc_assert (npatterns > 1);
@@ -23637,6 +23741,26 @@ aarch64_full_sve_mode (scalar_mode mode)
     }
 }
 
+/* Return the 32-bit Advanced SIMD vector mode for element mode MODE,
+   if it exists.  */
+opt_machine_mode
+aarch64_v32_mode (scalar_mode mode)
+{
+  switch (mode)
+    {
+    case E_HFmode:
+      return V2HFmode;
+    case E_BFmode:
+      return V2BFmode;
+    case E_HImode:
+      return V2HImode;
+    case E_QImode:
+      return V4QImode;
+    default:
+      return {};
+    }
+}
+
 /* Return the 64-bit Advanced SIMD vector mode for element mode MODE,
    if it exists.  */
 opt_machine_mode
@@ -23699,13 +23823,16 @@ aarch64_simd_container_mode (scalar_mode mode, 
poly_int64 width)
       && known_eq (width, BITS_PER_SVE_VECTOR))
     return aarch64_full_sve_mode (mode).else_mode (word_mode);
 
-  gcc_assert (known_eq (width, 64) || known_eq (width, 128));
+  gcc_assert (known_eq (width, 32) || known_eq (width, 64)
+             || known_eq (width, 128));
   if (TARGET_BASE_SIMD)
     {
       if (known_eq (width, 128))
        return aarch64_v128_mode (mode).else_mode (word_mode);
-      else
+      if (known_eq (width, 64))
        return aarch64_v64_mode (mode).else_mode (word_mode);
+      if (known_eq (width, 32))
+       return aarch64_v32_mode (mode).else_mode (word_mode);
     }
   return word_mode;
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
index cfb732abd55..e8ecba6ff06 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
@@ -27,7 +27,15 @@ f (uint8_t *x)
   x[14] += 1; // one less than the minimum vector length
 }
 
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].b, xzr, x[0-9]\n} 1 } 
} */
-/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].b, xzr, x[0-9]+\n} 1 } 
} */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rqb\tz[0-9]+\.b, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.b, p[0-9]+, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_10.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_10.c
new file mode 100644
index 00000000000..d40fcacfc4e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_10.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD word load if the vector constant is padded with
+   zeros.  */
+
+void
+f (uint8_t *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.b, vl3\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\ts[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rw\tz[0-9]+\.s, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.b, p[0-9]+, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_11.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_11.c
new file mode 100644
index 00000000000..4f78f953cf9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_11.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD doubleword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (uint16_t *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.h, vl3\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, 
z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\td[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rd\tz[0-9]+\.d, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.h, p[0-9]+, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_12.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_12.c
new file mode 100644
index 00000000000..a60d6435de8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_12.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD quadword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (uint32_t *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.s, vl3\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, 
z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rqw\tz[0-9]+\.s, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.s, p[0-9]+, z[0-9]+\.s, 
z[0-9]+\.s\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_13.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_13.c
new file mode 100644
index 00000000000..8a7fafd52b8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_13.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD quadword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (uint16_t *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+  x[3] += 2;
+  x[4] += 1;
+  x[5] += 2;
+  x[6] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.h, vl7\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, 
z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rqh\tz[0-9]+\.h, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.h, p[0-9]+, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_14.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_14.c
new file mode 100644
index 00000000000..30a293867e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_14.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD quadword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (float *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.s, vl3\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.s, p[0-7]/m, 
z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rqw\tz[0-9]+\.s, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.s, p[0-9]+, z[0-9]+\.s, 
z[0-9]+\.s\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_15.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_15.c
new file mode 100644
index 00000000000..1912bc29301
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_15.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD doubleword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (_Float16 *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 3;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.h, vl3\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.h, p[0-7]/m, 
z[0-9]+\.h, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\td[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rd\tz[0-9]+\.d, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.h, p[0-9]+, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
index 464f251f955..941d3773143 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
@@ -28,6 +28,6 @@ f (uint8_t *x)
 }
 
 /* { dg-final { scan-assembler-times {\tptrue\tp[0-7].b, mul3\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
index 7d16fdcad3e..cd202056706 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
@@ -28,6 +28,6 @@ f (uint8_t *x)
   x[15] += 2; // exactly fits the minimum vector length
 }
 
-/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
index 915ddb74fd1..71d7f06d77a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
@@ -28,6 +28,6 @@ f (uint8_t *x)
   x[15] += 2; // exactly fits the configured vector length
 }
 
-/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
index 98e91b1af9a..5d165b9ec45 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
@@ -30,10 +30,10 @@ f (uint8_t *x)
   x[16] += 1; // one more than the minimum vector length
 }
 
-/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 
-/* { dg-final { scan-assembler-times {\tldrb\tw[0-9]+, \[x[0-9], 16\]\n} 1 } } 
*/
+/* { dg-final { scan-assembler-times {\tldrb\tw[0-9]+, \[x[0-9]+, 16\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tadd\tw[0-9]+, w[0-9]+, 1\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, \[x[0-9], 16\]\n} 1 } } 
*/
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, \[x[0-9]+, 16\]\n} 1 } 
} */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
index ac9508301a3..b82ed25f9f1 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
@@ -29,10 +29,10 @@ f (uint8_t *x)
   x[16] += 1; // one more than the configured vector length
 }
 
-/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tstr\tq[0-9]+, \[x[0-9]+\]\n} 1 } } */
 
-/* { dg-final { scan-assembler-times {\tldrb\tw[0-9]+, \[x[0-9], 16\]\n} 1 } } 
*/
+/* { dg-final { scan-assembler-times {\tldrb\tw[0-9]+, \[x[0-9]+, 16\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tadd\tw[0-9]+, w[0-9]+, 1\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, \[x[0-9], 16\]\n} 1 } } 
*/
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, \[x[0-9]+, 16\]\n} 1 } 
} */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_9.c 
b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_9.c
new file mode 100644
index 00000000000..b83f4539096
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_pred_9.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize 
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+/* Test that we can vectorize with SVE predication when generating 
vector-length
+   agnostic code if the minimum possible vector length (of 16 bytes) is larger
+   than the number of elements to be processed and the addends can be loaded
+   using an Advanced SIMD doubleword load if the vector constant is padded with
+   zeros.  */
+
+void
+f (uint8_t *x)
+{
+  x[0] += 1;
+  x[1] += 2;
+  x[2] += 1;
+  x[3] += 2;
+  x[4] += 1;
+  x[5] += 2;
+  x[6] += 1;
+}
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+.b, vl7\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, 
\[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], 
\[x[0-9]+\]\n} 1 } } */
+
+/* Expect to load addends directly instead of loading and broadcasting the
+   values before conditionally selecting only the lower lanes, which is what
+   happens if the SVE CONST_VECTOR is expanded poorly.  */
+
+/* { dg-final { scan-assembler-times {\tldr\td[0-9]+, \[x[0-9]+, 
#:lo12:\.LANCHOR[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tld1rd\tz[0-9]+\.d, p[0-9]+/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tsel\tz[0-9]+\.b, p[0-9]+, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
-- 
2.43.0

[PATCH v12 8/8] aarch64: improve code generation in aarch64_expand_sve_const_vector

Reply via email to