Re: [PATCH] aarch64: Use LDR for first-element loads for Advanced SIMD

Dhruv Chawla Sun, 25 May 2025 20:09:49 -0700

On 08/05/25 18:43, Richard Sandiford wrote:

External email: Use caution opening links or attachments



Dhruv Chawla <dhr...@nvidia.com> writes:

This patch modifies Advanced SIMD assembly generation to emit an LDR
instruction when a vector is created using a load to the first element with the
other elements being zero.

This is similar to what *aarch64_combinez<mode> already does.

Example:

uint8x16_t foo(uint8_t *x) {
    uint8x16_t r = vdupq_n_u8(0);
    r = vsetq_lane_u8(*x, r, 0);
    return r;
}

Currently, this generates:

foo:
       movi    v0.4s, 0
       ld1     {v0.b}[0], [x0]
       ret

After applying the patch, this generates:

foo:
       ldr     b0, [x0]
       ret

Bootstrapped and regtested on aarch64-linux-gnu. Tested on
aarch64_be-unknown-linux-gnu as well.

Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>

gcc/ChangeLog:

       * config/aarch64/aarch64-simd.md
       (*aarch64_simd_vec_set_low<mode>): New pattern.

gcc/testsuite/ChangeLog:

       * gcc.target/aarch64/simd/ldr_first_le.c: New test.
       * gcc.target/aarch64/simd/ldr_first_be.c: Likewise.
---
   gcc/config/aarch64/aarch64-simd.md            |  12 ++
   .../gcc.target/aarch64/simd/ldr_first_be.c    | 140 ++++++++++++++++++
   .../gcc.target/aarch64/simd/ldr_first_le.c    | 139 +++++++++++++++++
   3 files changed, 291 insertions(+)
   create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
   create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index e2afe87e513..7be1c685fcf 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1164,6 +1164,18 @@
     [(set_attr "type" "neon_logic<q>")]
   )

+(define_insn "*aarch64_simd_vec_set_low<mode>"
+  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
+     (vec_merge:VALL_F16
+         (vec_duplicate:VALL_F16
+             (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "m"))


The constraint should be "Utv" rather than "m", since the operand doesn't
accept all addresses that are valid for <VEL>.  E.g. a normal SImode
memory would allow [reg, #imm], whereas this address does't.

+         (match_operand:VALL_F16 3 "aarch64_simd_imm_zero" "i")
+         (match_operand:SI 2 "immediate_operand" "i")))]


I think we should drop the two "i"s here, since the pattern doesn't
accept all immediates.  The predicate on the final operand should be
const_int_operand rather than immediate_operand.

Otherwise it looks good.  But I think we should think about how we
plan to integrate the related optimisation for register inputs.  E.g.:

int32x4_t foo(int32_t x) {
     return vsetq_lane_s32(x, vdupq_n_s32(0), 0);
}

generates:

foo:
         movi    v0.4s, 0
         ins     v0.s[0], w0
         ret

rather than a single UMOV.  Same idea when the input is in an FPR rather
than a GPR, but using FMOV rather than UMOV.

Conventionally, the register and memory forms should be listed as
alternatives in a single pattern, but that's somewhat complex because of
the different instruction availability for 64-bit+32-bit, 16-bit, and
8-bit register operations.

My worry is that if we handle the register case as an entirely separate
patch, it would have to rewrite this one.


I have been experimenting with this, and yeah, it gets quite messy when
trying to handle both memory and register cases together. Would it be okay
to enable the register case only for 64-/32-bit sizes? It would complicate
the code only a little and could still be done with a single pattern. I've
attached a patch that does the same.

-- >8 --

[PATCH] aarch64: Use LDR/FMOV for first-element loads/writes for Advanced SIMD

This patch modifies Advanced SIMD assembly generation to emit either an
LDR or an FMOV instruction when a load/write to the first element of a
vector is done when the other elements are zero.

The register move case is only enabled for 32-bit or 64-bit element sizes, as
FMOV has no 8-bit mode and 16-bit mode requires FEAT_FP16.

This is similar to what *aarch64_combinez<mode> already does.

Example:

uint8x16_t foo(uint8_t *x) {
  uint8x16_t r = vdupq_n_u8(0);
  r = vsetq_lane_u8(*x, r, 0);
  return r;
}

Currently, this generates:

foo:
        movi    v0.4s, 0
        ld1     {v0.b}[0], [x0]
        ret

After applying the patch, this generates:

foo:
        ldr     b0, [x0]
        ret

Bootstrapped and regtested on aarch64-linux-gnu. Tested on
an aarch64_be-unknown-linux-gnu cross-build as well.

Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>

gcc/ChangeLog:

        * config/aarch64/aarch64-simd.md
        (*aarch64_simd_vec_set_low<mode>): New pattern.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/pr109072_1.c (s32x4_2): Remove XFAIL.
        * gcc.target/aarch64/simd/ldr_first_le.c: New test.
        * gcc.target/aarch64/simd/ldr_first_be.c: Likewise.
        * gcc.target/aarch64/simd/ins_first_le.c: Likewise.
        * gcc.target/aarch64/simd/ins_first_be.c: Likewise.
---
 gcc/config/aarch64/aarch64-simd.md            |  18 +++
 gcc/testsuite/gcc.target/aarch64/pr109072_1.c |   2 +-
 .../gcc.target/aarch64/simd/ins_first_be.c    |  85 +++++++++++
 .../gcc.target/aarch64/simd/ins_first_le.c    |  84 +++++++++++
 .../gcc.target/aarch64/simd/ldr_first_be.c    | 140 ++++++++++++++++++
 .../gcc.target/aarch64/simd/ldr_first_le.c    | 139 +++++++++++++++++
 6 files changed, 467 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 6e30dc48934..5368b7f21fe 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1164,6 +1164,24 @@
   [(set_attr "type" "neon_logic<q>")]
 )

+(define_insn "*aarch64_simd_vec_set_low<mode>"

+  [(set (match_operand:VALL_F16 0 "register_operand")
+       (vec_merge:VALL_F16
+         (vec_duplicate:VALL_F16
+           (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand"))
+         (match_operand:VALL_F16 3 "aarch64_simd_imm_zero")
+         (match_operand:SI 2 "const_int_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (<nunits>, exact_log2 (INTVAL (operands[2]))) == 0
+   && (aarch64_simd_mem_operand_p (operands[1]) ||
+       GET_MODE_UNIT_BITSIZE (<MODE>mode) >= 32)"
+  {@ [ cons: =0 , 1   ; attrs: type  ]
+     [ w        , w   ; neon_move<q> ] fmov\t%<Vetype>0, %<Vetype>1
+     [ w        , r   ; neon_from_gp ] fmov\t%<Vetype>0, %<vwcore>1
+     [ w        , Utv ; f_loads      ] ldr\t%<Vetype>0, %1
+  }
+)
+
 (define_insn "@aarch64_simd_vec_set<mode>"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
        (vec_merge:VALL_F16
diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
index 39d80222142..1af957de0bc 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
@@ -189,7 +189,7 @@ s32x4_1 (int32_t x)
 }

/*

-** s32x4_2: { xfail *-*-* }
+** s32x4_2:
 **     fmov    s0, w0
 **     ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
new file mode 100644
index 00000000000..c481f9b3d99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
@@ -0,0 +1,85 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mbig-endian -march=armv8-a" } */
+/* { dg-require-effective-target stdint_types_mbig_endian } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define INS_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    return vsetq_lane_##U (x, vdupq_n_##U (0), 0);                             
\
+  }
+
+INS_ACLE (int32x4_t, int32_t, s32)
+INS_ACLE (int64x2_t, int64_t, s64)
+INS_ACLE (uint32x4_t, uint32_t, u32)
+INS_ACLE (uint64x2_t, uint64_t, u64)
+INS_ACLE (float32x4_t, float32_t, f32)
+INS_ACLE (float64x2_t, float64_t, f64)
+
+#define INS_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (x, r, 0);                                               
\
+    return r;                                                                  
\
+  }
+
+INS_ACLE_NARROW (int32x2_t, int32_t, s32)
+INS_ACLE_NARROW (int64x1_t, int64_t, s64)
+INS_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+INS_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+INS_ACLE_NARROW (float32x2_t, float32_t, f32)
+INS_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+/* Tests using GCC vector types.  */
+
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+#define INS_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = x;                                        
\
+    return r;                                                                  
\
+  }
+
+INS_GCC (v4i32, int32_t, s32)
+INS_GCC (v2i64, int64_t, s64)
+INS_GCC (v4u32, uint32_t, u32)
+INS_GCC (v2u64, uint64_t, u64)
+INS_GCC (v4f32, float32_t, f32)
+INS_GCC (v2f64, float64_t, f64)
+
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+#define INS_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = x;                                        
\
+    return r;                                                                  
\
+  }
+
+INS_GCC_NARROW (v2i32, int32_t, s32)
+INS_GCC_NARROW (v1i64, int64_t, s64)
+INS_GCC_NARROW (v2u32, uint32_t, u32)
+INS_GCC_NARROW (v1u64, uint64_t, u64)
+INS_GCC_NARROW (v2f32, float32_t, f32)
+INS_GCC_NARROW (v1f64, float64_t, f64)
+
+/* Both float64x1_t and v1f64 are optimized to a single ret.  */
+/* { dg-final { scan-assembler-times "\tfmov\t" 22 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
new file mode 100644
index 00000000000..9e434bf1f46
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
@@ -0,0 +1,84 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mlittle-endian -march=armv8-a" } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define INS_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    return vsetq_lane_##U (x, vdupq_n_##U (0), 0);                             
\
+  }
+
+INS_ACLE (int32x4_t, int32_t, s32)
+INS_ACLE (int64x2_t, int64_t, s64)
+INS_ACLE (uint32x4_t, uint32_t, u32)
+INS_ACLE (uint64x2_t, uint64_t, u64)
+INS_ACLE (float32x4_t, float32_t, f32)
+INS_ACLE (float64x2_t, float64_t, f64)
+
+#define INS_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (x, r, 0);                                               
\
+    return r;                                                                  
\
+  }
+
+INS_ACLE_NARROW (int32x2_t, int32_t, s32)
+INS_ACLE_NARROW (int64x1_t, int64_t, s64)
+INS_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+INS_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+INS_ACLE_NARROW (float32x2_t, float32_t, f32)
+INS_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+/* Tests using GCC vector types.  */
+
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+#define INS_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = x;                                                                  
\
+    return r;                                                                  
\
+  }
+
+INS_GCC (v4i32, int32_t, s32)
+INS_GCC (v2i64, int64_t, s64)
+INS_GCC (v4u32, uint32_t, u32)
+INS_GCC (v2u64, uint64_t, u64)
+INS_GCC (v4f32, float32_t, f32)
+INS_GCC (v2f64, float64_t, f64)
+
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+#define INS_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = x;                                                                  
\
+    return r;                                                                  
\
+  }
+
+INS_GCC_NARROW (v2i32, int32_t, s32)
+INS_GCC_NARROW (v1i64, int64_t, s64)
+INS_GCC_NARROW (v2u32, uint32_t, u32)
+INS_GCC_NARROW (v1u64, uint64_t, u64)
+INS_GCC_NARROW (v2f32, float32_t, f32)
+INS_GCC_NARROW (v1f64, float64_t, f64)
+
+/* Both float64x1_t and v1f64 are optimized to a single ret.  */
+/* { dg-final { scan-assembler-times "\tfmov\t" 22 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
new file mode 100644
index 00000000000..12dd01594a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
@@ -0,0 +1,140 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mbig-endian -march=armv8-a+bf16" } */
+/* { dg-require-effective-target stdint_types_mbig_endian } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define LDR_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdupq_n_##U (0);                                                     
\
+    r = vsetq_lane_##U (*x, r, 0);                                             
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE (int8x16_t, int8_t, s8)
+LDR_ACLE (int16x8_t, int16_t, s16)
+LDR_ACLE (int32x4_t, int32_t, s32)
+LDR_ACLE (int64x2_t, int64_t, s64)
+
+LDR_ACLE (uint8x16_t, uint8_t, u8)
+LDR_ACLE (uint16x8_t, uint16_t, u16)
+LDR_ACLE (uint32x4_t, uint32_t, u32)
+LDR_ACLE (uint64x2_t, uint64_t, u64)
+
+LDR_ACLE (float16x8_t, float16_t, f16)
+LDR_ACLE (float32x4_t, float32_t, f32)
+LDR_ACLE (float64x2_t, float64_t, f64)
+
+LDR_ACLE (bfloat16x8_t, bfloat16_t, bf16)
+
+#define LDR_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (*x, r, 0);                                              
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE_NARROW (int8x8_t, int8_t, s8)
+LDR_ACLE_NARROW (int16x4_t, int16_t, s16)
+LDR_ACLE_NARROW (int32x2_t, int32_t, s32)
+LDR_ACLE_NARROW (int64x1_t, int64_t, s64)
+
+LDR_ACLE_NARROW (uint8x8_t, uint8_t, u8)
+LDR_ACLE_NARROW (uint16x4_t, uint16_t, u16)
+LDR_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+LDR_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+
+LDR_ACLE_NARROW (float16x4_t, float16_t, f16)
+LDR_ACLE_NARROW (float32x2_t, float32_t, f32)
+LDR_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+LDR_ACLE_NARROW (bfloat16x4_t, bfloat16_t, bf16)
+
+/* Tests using GCC vector types.  */
+
+typedef int8_t v16i8 __attribute__ ((vector_size (16)));
+typedef int16_t v8i16 __attribute__ ((vector_size (16)));
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+
+typedef uint8_t v16u8 __attribute__ ((vector_size (16)));
+typedef uint16_t v8u16 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+
+typedef float16_t v8f16 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+typedef bfloat16_t v8bf16 __attribute__ ((vector_size (16)));
+
+#define LDR_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = *x;                                       
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC (v16i8, int8_t, s8)
+LDR_GCC (v8i16, int16_t, s16)
+LDR_GCC (v4i32, int32_t, s32)
+LDR_GCC (v2i64, int64_t, s64)
+
+LDR_GCC (v16u8, uint8_t, u8)
+LDR_GCC (v8u16, uint16_t, u16)
+LDR_GCC (v4u32, uint32_t, u32)
+LDR_GCC (v2u64, uint64_t, u64)
+
+LDR_GCC (v8f16, float16_t, f16)
+LDR_GCC (v4f32, float32_t, f32)
+LDR_GCC (v2f64, float64_t, f64)
+
+LDR_GCC (v8bf16, bfloat16_t, bf16)
+
+typedef int8_t v8i8 __attribute__ ((vector_size (8)));
+typedef int16_t v4i16 __attribute__ ((vector_size (8)));
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+
+typedef uint8_t v8u8 __attribute__ ((vector_size (8)));
+typedef uint16_t v4u16 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+
+typedef float16_t v4f16 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+typedef bfloat16_t v4bf16 __attribute__ ((vector_size (8)));
+
+#define LDR_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = *x;                                       
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC_NARROW (v8i8, int8_t, s8)
+LDR_GCC_NARROW (v4i16, int16_t, s16)
+LDR_GCC_NARROW (v2i32, int32_t, s32)
+LDR_GCC_NARROW (v1i64, int64_t, s64)
+
+LDR_GCC_NARROW (v8u8, uint8_t, u8)
+LDR_GCC_NARROW (v4u16, uint16_t, u16)
+LDR_GCC_NARROW (v2u32, uint32_t, u32)
+LDR_GCC_NARROW (v1u64, uint64_t, u64)
+
+LDR_GCC_NARROW (v4f16, float16_t, f16)
+LDR_GCC_NARROW (v2f32, float32_t, f32)
+LDR_GCC_NARROW (v1f64, float64_t, f64)
+
+LDR_GCC_NARROW (v4bf16, bfloat16_t, bf16)
+
+/* { dg-final { scan-assembler-times "\tldr\t" 48 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c
new file mode 100644
index 00000000000..3d69523c500
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c
@@ -0,0 +1,139 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mlittle-endian -march=armv8-a+bf16" } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define LDR_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdupq_n_##U (0);                                                     
\
+    r = vsetq_lane_##U (*x, r, 0);                                             
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE (int8x16_t, int8_t, s8)
+LDR_ACLE (int16x8_t, int16_t, s16)
+LDR_ACLE (int32x4_t, int32_t, s32)
+LDR_ACLE (int64x2_t, int64_t, s64)
+
+LDR_ACLE (uint8x16_t, uint8_t, u8)
+LDR_ACLE (uint16x8_t, uint16_t, u16)
+LDR_ACLE (uint32x4_t, uint32_t, u32)
+LDR_ACLE (uint64x2_t, uint64_t, u64)
+
+LDR_ACLE (float16x8_t, float16_t, f16)
+LDR_ACLE (float32x4_t, float32_t, f32)
+LDR_ACLE (float64x2_t, float64_t, f64)
+
+LDR_ACLE (bfloat16x8_t, bfloat16_t, bf16)
+
+#define LDR_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (*x, r, 0);                                              
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE_NARROW (int8x8_t, int8_t, s8)
+LDR_ACLE_NARROW (int16x4_t, int16_t, s16)
+LDR_ACLE_NARROW (int32x2_t, int32_t, s32)
+LDR_ACLE_NARROW (int64x1_t, int64_t, s64)
+
+LDR_ACLE_NARROW (uint8x8_t, uint8_t, u8)
+LDR_ACLE_NARROW (uint16x4_t, uint16_t, u16)
+LDR_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+LDR_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+
+LDR_ACLE_NARROW (float16x4_t, float16_t, f16)
+LDR_ACLE_NARROW (float32x2_t, float32_t, f32)
+LDR_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+LDR_ACLE_NARROW (bfloat16x4_t, bfloat16_t, bf16)
+
+/* Tests using GCC vector types.  */
+
+typedef int8_t v16i8 __attribute__ ((vector_size (16)));
+typedef int16_t v8i16 __attribute__ ((vector_size (16)));
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+
+typedef uint8_t v16u8 __attribute__ ((vector_size (16)));
+typedef uint16_t v8u16 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+
+typedef float16_t v8f16 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+typedef bfloat16_t v8bf16 __attribute__ ((vector_size (16)));
+
+#define LDR_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = *x;                                                                 
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC (v16i8, int8_t, s8)
+LDR_GCC (v8i16, int16_t, s16)
+LDR_GCC (v4i32, int32_t, s32)
+LDR_GCC (v2i64, int64_t, s64)
+
+LDR_GCC (v16u8, uint8_t, u8)
+LDR_GCC (v8u16, uint16_t, u16)
+LDR_GCC (v4u32, uint32_t, u32)
+LDR_GCC (v2u64, uint64_t, u64)
+
+LDR_GCC (v8f16, float16_t, f16)
+LDR_GCC (v4f32, float32_t, f32)
+LDR_GCC (v2f64, float64_t, f64)
+
+LDR_GCC (v8bf16, bfloat16_t, bf16)
+
+typedef int8_t v8i8 __attribute__ ((vector_size (8)));
+typedef int16_t v4i16 __attribute__ ((vector_size (8)));
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+
+typedef uint8_t v8u8 __attribute__ ((vector_size (8)));
+typedef uint16_t v4u16 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+
+typedef float16_t v4f16 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+typedef bfloat16_t v4bf16 __attribute__ ((vector_size (8)));
+
+#define LDR_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = *x;                                                                 
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC_NARROW (v8i8, int8_t, s8)
+LDR_GCC_NARROW (v4i16, int16_t, s16)
+LDR_GCC_NARROW (v2i32, int32_t, s32)
+LDR_GCC_NARROW (v1i64, int64_t, s64)
+
+LDR_GCC_NARROW (v8u8, uint8_t, u8)
+LDR_GCC_NARROW (v4u16, uint16_t, u16)
+LDR_GCC_NARROW (v2u32, uint32_t, u32)
+LDR_GCC_NARROW (v1u64, uint64_t, u64)
+
+LDR_GCC_NARROW (v4f16, float16_t, f16)
+LDR_GCC_NARROW (v2f32, float32_t, f32)
+LDR_GCC_NARROW (v1f64, float64_t, f64)
+
+LDR_GCC_NARROW (v4bf16, bfloat16_t, bf16)
+
+/* { dg-final { scan-assembler-times "\tldr\t" 48 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
--
2.44.0


The register case is somewhat related to Pengxuan's work on permutations.

Thanks,
Richard


--
Regards,
Dhruv

From 03531569b683fb9894b970fbdbfa3113d2d5202c Mon Sep 17 00:00:00 2001
From: Dhruv Chawla <dhr...@nvidia.com>
Date: Thu, 19 Dec 2024 19:56:23 -0800
Subject: [PATCH] aarch64: Use LDR/FMOV for first-element loads/writes for
 Advanced SIMD

This patch modifies Advanced SIMD assembly generation to emit either an
LDR or an FMOV instruction when a load/write to the first element of a
vector is done when the other elements are zero.

The register move case is only enabled for 32-bit or 64-bit element sizes, as
FMOV has no 8-bit mode and 16-bit mode requires FEAT_FP16.

This is similar to what *aarch64_combinez<mode> already does.

Example:

uint8x16_t foo(uint8_t *x) {
  uint8x16_t r = vdupq_n_u8(0);
  r = vsetq_lane_u8(*x, r, 0);
  return r;
}

Currently, this generates:

foo:
        movi    v0.4s, 0
        ld1     {v0.b}[0], [x0]
        ret

After applying the patch, this generates:

foo:
        ldr     b0, [x0]
        ret

Bootstrapped and regtested on aarch64-linux-gnu. Tested on
an aarch64_be-unknown-linux-gnu cross-build as well.

Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>

gcc/ChangeLog:

        * config/aarch64/aarch64-simd.md
        (*aarch64_simd_vec_set_low<mode>): New pattern.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/pr109072_1.c (s32x4_2): Remove XFAIL.
        * gcc.target/aarch64/simd/ldr_first_le.c: New test.
        * gcc.target/aarch64/simd/ldr_first_be.c: Likewise.
        * gcc.target/aarch64/simd/ins_first_le.c: Likewise.
        * gcc.target/aarch64/simd/ins_first_be.c: Likewise.
---
 gcc/config/aarch64/aarch64-simd.md            |  18 +++
 gcc/testsuite/gcc.target/aarch64/pr109072_1.c |   2 +-
 .../gcc.target/aarch64/simd/ins_first_be.c    |  85 +++++++++++
 .../gcc.target/aarch64/simd/ins_first_le.c    |  84 +++++++++++
 .../gcc.target/aarch64/simd/ldr_first_be.c    | 140 ++++++++++++++++++
 .../gcc.target/aarch64/simd/ldr_first_le.c    | 139 +++++++++++++++++
 6 files changed, 467 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 6e30dc48934..5368b7f21fe 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1164,6 +1164,24 @@
   [(set_attr "type" "neon_logic<q>")]
 )
 
+(define_insn "*aarch64_simd_vec_set_low<mode>"
+  [(set (match_operand:VALL_F16 0 "register_operand")
+       (vec_merge:VALL_F16
+         (vec_duplicate:VALL_F16
+           (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand"))
+         (match_operand:VALL_F16 3 "aarch64_simd_imm_zero")
+         (match_operand:SI 2 "const_int_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (<nunits>, exact_log2 (INTVAL (operands[2]))) == 0
+   && (aarch64_simd_mem_operand_p (operands[1]) ||
+       GET_MODE_UNIT_BITSIZE (<MODE>mode) >= 32)"
+  {@ [ cons: =0 , 1   ; attrs: type  ]
+     [ w        , w   ; neon_move<q> ] fmov\t%<Vetype>0, %<Vetype>1
+     [ w        , r   ; neon_from_gp ] fmov\t%<Vetype>0, %<vwcore>1
+     [ w        , Utv ; f_loads      ] ldr\t%<Vetype>0, %1
+  }
+)
+
 (define_insn "@aarch64_simd_vec_set<mode>"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
        (vec_merge:VALL_F16
diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
index 39d80222142..1af957de0bc 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
@@ -189,7 +189,7 @@ s32x4_1 (int32_t x)
 }
 
 /*
-** s32x4_2: { xfail *-*-* }
+** s32x4_2:
 **     fmov    s0, w0
 **     ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
new file mode 100644
index 00000000000..c481f9b3d99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_be.c
@@ -0,0 +1,85 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mbig-endian -march=armv8-a" } */
+/* { dg-require-effective-target stdint_types_mbig_endian } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define INS_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    return vsetq_lane_##U (x, vdupq_n_##U (0), 0);                             
\
+  }
+
+INS_ACLE (int32x4_t, int32_t, s32)
+INS_ACLE (int64x2_t, int64_t, s64)
+INS_ACLE (uint32x4_t, uint32_t, u32)
+INS_ACLE (uint64x2_t, uint64_t, u64)
+INS_ACLE (float32x4_t, float32_t, f32)
+INS_ACLE (float64x2_t, float64_t, f64)
+
+#define INS_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (x, r, 0);                                               
\
+    return r;                                                                  
\
+  }
+
+INS_ACLE_NARROW (int32x2_t, int32_t, s32)
+INS_ACLE_NARROW (int64x1_t, int64_t, s64)
+INS_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+INS_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+INS_ACLE_NARROW (float32x2_t, float32_t, f32)
+INS_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+/* Tests using GCC vector types.  */
+
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+#define INS_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = x;                                        
\
+    return r;                                                                  
\
+  }
+
+INS_GCC (v4i32, int32_t, s32)
+INS_GCC (v2i64, int64_t, s64)
+INS_GCC (v4u32, uint32_t, u32)
+INS_GCC (v2u64, uint64_t, u64)
+INS_GCC (v4f32, float32_t, f32)
+INS_GCC (v2f64, float64_t, f64)
+
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+#define INS_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = x;                                        
\
+    return r;                                                                  
\
+  }
+
+INS_GCC_NARROW (v2i32, int32_t, s32)
+INS_GCC_NARROW (v1i64, int64_t, s64)
+INS_GCC_NARROW (v2u32, uint32_t, u32)
+INS_GCC_NARROW (v1u64, uint64_t, u64)
+INS_GCC_NARROW (v2f32, float32_t, f32)
+INS_GCC_NARROW (v1f64, float64_t, f64)
+
+/* Both float64x1_t and v1f64 are optimized to a single ret.  */
+/* { dg-final { scan-assembler-times "\tfmov\t" 22 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
new file mode 100644
index 00000000000..9e434bf1f46
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ins_first_le.c
@@ -0,0 +1,84 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mlittle-endian -march=armv8-a" } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define INS_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    return vsetq_lane_##U (x, vdupq_n_##U (0), 0);                             
\
+  }
+
+INS_ACLE (int32x4_t, int32_t, s32)
+INS_ACLE (int64x2_t, int64_t, s64)
+INS_ACLE (uint32x4_t, uint32_t, u32)
+INS_ACLE (uint64x2_t, uint64_t, u64)
+INS_ACLE (float32x4_t, float32_t, f32)
+INS_ACLE (float64x2_t, float64_t, f64)
+
+#define INS_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T x)                            
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (x, r, 0);                                               
\
+    return r;                                                                  
\
+  }
+
+INS_ACLE_NARROW (int32x2_t, int32_t, s32)
+INS_ACLE_NARROW (int64x1_t, int64_t, s64)
+INS_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+INS_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+INS_ACLE_NARROW (float32x2_t, float32_t, f32)
+INS_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+/* Tests using GCC vector types.  */
+
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+#define INS_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = x;                                                                  
\
+    return r;                                                                  
\
+  }
+
+INS_GCC (v4i32, int32_t, s32)
+INS_GCC (v2i64, int64_t, s64)
+INS_GCC (v4u32, uint32_t, u32)
+INS_GCC (v2u64, uint64_t, u64)
+INS_GCC (v4f32, float32_t, f32)
+INS_GCC (v2f64, float64_t, f64)
+
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+#define INS_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T x)                             
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = x;                                                                  
\
+    return r;                                                                  
\
+  }
+
+INS_GCC_NARROW (v2i32, int32_t, s32)
+INS_GCC_NARROW (v1i64, int64_t, s64)
+INS_GCC_NARROW (v2u32, uint32_t, u32)
+INS_GCC_NARROW (v1u64, uint64_t, u64)
+INS_GCC_NARROW (v2f32, float32_t, f32)
+INS_GCC_NARROW (v1f64, float64_t, f64)
+
+/* Both float64x1_t and v1f64 are optimized to a single ret.  */
+/* { dg-final { scan-assembler-times "\tfmov\t" 22 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
new file mode 100644
index 00000000000..12dd01594a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
@@ -0,0 +1,140 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mbig-endian -march=armv8-a+bf16" } */
+/* { dg-require-effective-target stdint_types_mbig_endian } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define LDR_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdupq_n_##U (0);                                                     
\
+    r = vsetq_lane_##U (*x, r, 0);                                             
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE (int8x16_t, int8_t, s8)
+LDR_ACLE (int16x8_t, int16_t, s16)
+LDR_ACLE (int32x4_t, int32_t, s32)
+LDR_ACLE (int64x2_t, int64_t, s64)
+
+LDR_ACLE (uint8x16_t, uint8_t, u8)
+LDR_ACLE (uint16x8_t, uint16_t, u16)
+LDR_ACLE (uint32x4_t, uint32_t, u32)
+LDR_ACLE (uint64x2_t, uint64_t, u64)
+
+LDR_ACLE (float16x8_t, float16_t, f16)
+LDR_ACLE (float32x4_t, float32_t, f32)
+LDR_ACLE (float64x2_t, float64_t, f64)
+
+LDR_ACLE (bfloat16x8_t, bfloat16_t, bf16)
+
+#define LDR_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (*x, r, 0);                                              
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE_NARROW (int8x8_t, int8_t, s8)
+LDR_ACLE_NARROW (int16x4_t, int16_t, s16)
+LDR_ACLE_NARROW (int32x2_t, int32_t, s32)
+LDR_ACLE_NARROW (int64x1_t, int64_t, s64)
+
+LDR_ACLE_NARROW (uint8x8_t, uint8_t, u8)
+LDR_ACLE_NARROW (uint16x4_t, uint16_t, u16)
+LDR_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+LDR_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+
+LDR_ACLE_NARROW (float16x4_t, float16_t, f16)
+LDR_ACLE_NARROW (float32x2_t, float32_t, f32)
+LDR_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+LDR_ACLE_NARROW (bfloat16x4_t, bfloat16_t, bf16)
+
+/* Tests using GCC vector types.  */
+
+typedef int8_t v16i8 __attribute__ ((vector_size (16)));
+typedef int16_t v8i16 __attribute__ ((vector_size (16)));
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+
+typedef uint8_t v16u8 __attribute__ ((vector_size (16)));
+typedef uint16_t v8u16 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+
+typedef float16_t v8f16 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+typedef bfloat16_t v8bf16 __attribute__ ((vector_size (16)));
+
+#define LDR_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = *x;                                       
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC (v16i8, int8_t, s8)
+LDR_GCC (v8i16, int16_t, s16)
+LDR_GCC (v4i32, int32_t, s32)
+LDR_GCC (v2i64, int64_t, s64)
+
+LDR_GCC (v16u8, uint8_t, u8)
+LDR_GCC (v8u16, uint16_t, u16)
+LDR_GCC (v4u32, uint32_t, u32)
+LDR_GCC (v2u64, uint64_t, u64)
+
+LDR_GCC (v8f16, float16_t, f16)
+LDR_GCC (v4f32, float32_t, f32)
+LDR_GCC (v2f64, float64_t, f64)
+
+LDR_GCC (v8bf16, bfloat16_t, bf16)
+
+typedef int8_t v8i8 __attribute__ ((vector_size (8)));
+typedef int16_t v4i16 __attribute__ ((vector_size (8)));
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+
+typedef uint8_t v8u8 __attribute__ ((vector_size (8)));
+typedef uint16_t v4u16 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+
+typedef float16_t v4f16 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+typedef bfloat16_t v4bf16 __attribute__ ((vector_size (8)));
+
+#define LDR_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[sizeof (S) / sizeof (T) - 1] = *x;                                       
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC_NARROW (v8i8, int8_t, s8)
+LDR_GCC_NARROW (v4i16, int16_t, s16)
+LDR_GCC_NARROW (v2i32, int32_t, s32)
+LDR_GCC_NARROW (v1i64, int64_t, s64)
+
+LDR_GCC_NARROW (v8u8, uint8_t, u8)
+LDR_GCC_NARROW (v4u16, uint16_t, u16)
+LDR_GCC_NARROW (v2u32, uint32_t, u32)
+LDR_GCC_NARROW (v1u64, uint64_t, u64)
+
+LDR_GCC_NARROW (v4f16, float16_t, f16)
+LDR_GCC_NARROW (v2f32, float32_t, f32)
+LDR_GCC_NARROW (v1f64, float64_t, f64)
+
+LDR_GCC_NARROW (v4bf16, bfloat16_t, bf16)
+
+/* { dg-final { scan-assembler-times "\tldr\t" 48 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c 
b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c
new file mode 100644
index 00000000000..3d69523c500
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c
@@ -0,0 +1,139 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mlittle-endian -march=armv8-a+bf16" } */
+
+/* Tests using ACLE intrinsics.  */
+
+#include <arm_neon.h>
+
+#define LDR_ACLE(S, T, U)                                                      
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdupq_n_##U (0);                                                     
\
+    r = vsetq_lane_##U (*x, r, 0);                                             
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE (int8x16_t, int8_t, s8)
+LDR_ACLE (int16x8_t, int16_t, s16)
+LDR_ACLE (int32x4_t, int32_t, s32)
+LDR_ACLE (int64x2_t, int64_t, s64)
+
+LDR_ACLE (uint8x16_t, uint8_t, u8)
+LDR_ACLE (uint16x8_t, uint16_t, u16)
+LDR_ACLE (uint32x4_t, uint32_t, u32)
+LDR_ACLE (uint64x2_t, uint64_t, u64)
+
+LDR_ACLE (float16x8_t, float16_t, f16)
+LDR_ACLE (float32x4_t, float32_t, f32)
+LDR_ACLE (float64x2_t, float64_t, f64)
+
+LDR_ACLE (bfloat16x8_t, bfloat16_t, bf16)
+
+#define LDR_ACLE_NARROW(S, T, U)                                               
\
+  __attribute__ ((noinline)) S acle_##S##T##U (T *x)                           
\
+  {                                                                            
\
+    S r = vdup_n_##U (0);                                                      
\
+    r = vset_lane_##U (*x, r, 0);                                              
\
+    return r;                                                                  
\
+  }
+
+LDR_ACLE_NARROW (int8x8_t, int8_t, s8)
+LDR_ACLE_NARROW (int16x4_t, int16_t, s16)
+LDR_ACLE_NARROW (int32x2_t, int32_t, s32)
+LDR_ACLE_NARROW (int64x1_t, int64_t, s64)
+
+LDR_ACLE_NARROW (uint8x8_t, uint8_t, u8)
+LDR_ACLE_NARROW (uint16x4_t, uint16_t, u16)
+LDR_ACLE_NARROW (uint32x2_t, uint32_t, u32)
+LDR_ACLE_NARROW (uint64x1_t, uint64_t, u64)
+
+LDR_ACLE_NARROW (float16x4_t, float16_t, f16)
+LDR_ACLE_NARROW (float32x2_t, float32_t, f32)
+LDR_ACLE_NARROW (float64x1_t, float64_t, f64)
+
+LDR_ACLE_NARROW (bfloat16x4_t, bfloat16_t, bf16)
+
+/* Tests using GCC vector types.  */
+
+typedef int8_t v16i8 __attribute__ ((vector_size (16)));
+typedef int16_t v8i16 __attribute__ ((vector_size (16)));
+typedef int32_t v4i32 __attribute__ ((vector_size (16)));
+typedef int64_t v2i64 __attribute__ ((vector_size (16)));
+
+typedef uint8_t v16u8 __attribute__ ((vector_size (16)));
+typedef uint16_t v8u16 __attribute__ ((vector_size (16)));
+typedef uint32_t v4u32 __attribute__ ((vector_size (16)));
+typedef uint64_t v2u64 __attribute__ ((vector_size (16)));
+
+typedef float16_t v8f16 __attribute__ ((vector_size (16)));
+typedef float32_t v4f32 __attribute__ ((vector_size (16)));
+typedef float64_t v2f64 __attribute__ ((vector_size (16)));
+
+typedef bfloat16_t v8bf16 __attribute__ ((vector_size (16)));
+
+#define LDR_GCC(S, T, U)                                                       
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = *x;                                                                 
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC (v16i8, int8_t, s8)
+LDR_GCC (v8i16, int16_t, s16)
+LDR_GCC (v4i32, int32_t, s32)
+LDR_GCC (v2i64, int64_t, s64)
+
+LDR_GCC (v16u8, uint8_t, u8)
+LDR_GCC (v8u16, uint16_t, u16)
+LDR_GCC (v4u32, uint32_t, u32)
+LDR_GCC (v2u64, uint64_t, u64)
+
+LDR_GCC (v8f16, float16_t, f16)
+LDR_GCC (v4f32, float32_t, f32)
+LDR_GCC (v2f64, float64_t, f64)
+
+LDR_GCC (v8bf16, bfloat16_t, bf16)
+
+typedef int8_t v8i8 __attribute__ ((vector_size (8)));
+typedef int16_t v4i16 __attribute__ ((vector_size (8)));
+typedef int32_t v2i32 __attribute__ ((vector_size (8)));
+typedef int64_t v1i64 __attribute__ ((vector_size (8)));
+
+typedef uint8_t v8u8 __attribute__ ((vector_size (8)));
+typedef uint16_t v4u16 __attribute__ ((vector_size (8)));
+typedef uint32_t v2u32 __attribute__ ((vector_size (8)));
+typedef uint64_t v1u64 __attribute__ ((vector_size (8)));
+
+typedef float16_t v4f16 __attribute__ ((vector_size (8)));
+typedef float32_t v2f32 __attribute__ ((vector_size (8)));
+typedef float64_t v1f64 __attribute__ ((vector_size (8)));
+
+typedef bfloat16_t v4bf16 __attribute__ ((vector_size (8)));
+
+#define LDR_GCC_NARROW(S, T, U)                                                
\
+  __attribute__ ((noinline)) S gcc_##S##T##U (T *x)                            
\
+  {                                                                            
\
+    S r = {0};                                                                 
\
+    r[0] = *x;                                                                 
\
+    return r;                                                                  
\
+  }
+
+LDR_GCC_NARROW (v8i8, int8_t, s8)
+LDR_GCC_NARROW (v4i16, int16_t, s16)
+LDR_GCC_NARROW (v2i32, int32_t, s32)
+LDR_GCC_NARROW (v1i64, int64_t, s64)
+
+LDR_GCC_NARROW (v8u8, uint8_t, u8)
+LDR_GCC_NARROW (v4u16, uint16_t, u16)
+LDR_GCC_NARROW (v2u32, uint32_t, u32)
+LDR_GCC_NARROW (v1u64, uint64_t, u64)
+
+LDR_GCC_NARROW (v4f16, float16_t, f16)
+LDR_GCC_NARROW (v2f32, float32_t, f32)
+LDR_GCC_NARROW (v1f64, float64_t, f64)
+
+LDR_GCC_NARROW (v4bf16, bfloat16_t, bf16)
+
+/* { dg-final { scan-assembler-times "\tldr\t" 48 } } */
+/* { dg-final { scan-assembler-not "\tmov\t" } } */
-- 
2.44.0

Re: [PATCH] aarch64: Use LDR for first-element loads for Advanced SIMD

Reply via email to