[PATCH] RISC-V: Introduce vector lowering of VEC_PERM_EXPR for large vector types

Dusan Stojkovic Sun, 01 Dec 2024 07:50:21 -0800

This patch introduces partial vectorization support for VEC_PERM_EXPR
when vector types specified are not supported on some architectures
because of their size.


Take, for instance, this vector type:
typedef int32_t vnx32si __attribute__ ((vector_size (128)));

For -march=rv64gcv_zvl256b GCC doesn't vectorize the following code:
__attribute__ ((noipa)) void permute_vnx32si (vnx32si values1,
                                       vnx32si values2,
                                           vnx32si *out) {
   vnx32si v =
        __builtin_shufflevector (values1, values2, 1, 1, 1, 1, 1, 1, 1, 1,
                                                   1, 1, 1, 1, 1, 1, 1, 1,
                                                   1, 1, 1, 1, 1, 1, 1, 1,
                                                   1, 1, 1, 1, 1, 1, 1, 1);
    *(vnx32si *) out = v;
}

GCC produces 1xlw and 32xsw instructions, reading values1[1]
and storing it using a scalar instruction.

With this patch applied the resulting assembly would look like:
permute_vnx32si:
        vsetivli        zero,8,e32,m1,ta,ma
        vle32.v v2,0(a0)
        addi    sp,sp,-128
        addi    a3,a2,32
        addi    a4,a2,64
        addi    a5,a2,96
        vrgather.vi     v1,v2,1
        vse32.v v1,0(a2)
        vse32.v v1,0(a3)
        vse32.v v1,0(a4)
        vse32.v v1,0(a5)
        addi    sp,sp,128
        jr      ra

With this patch vectorization of VEC_PERM_EXPR is possible for indexes
which are bounded by split_type. split_type is the biggest vector type
which is supported based on can_vec_perm_var_p. In this case it is
8 elements * 32 element size = 256 bits.

The optimization works by iterating through the vector and sorting
element values in two groups: values which are in the
[0, split_elements) range they can be vectorized by reading from the
first address of vector and ones which are greater than split_elements.
Since there are two vectors to choose from for lowering the valid values
for the second vector are [elements, split_elements).
The values which are not in their respectable range are handled
using scalar instructions.

Values out of range are set to 0 in the original vector and are written
to twice: first when the vector store instruction writes the value 0
initially, second when the scalar store corrects the error.

There is a condition, however, when this approach produces poor code.
The worst case is when all indexes are out of range. This would produce
something like:
        vsetivli        zero,8,e32,m1,ta,ma # config
        vle32.v v2,0(a0)  # One load for start of vector
        addi    a3,a2,off # off = i*split_elements,
                          # i   = [0,elements/split_elemets - 1)
                          # totaling div - 1  addi for vse32
        vrgather.vi       # v1,v2,1 # singular
        vse32.v v1,0(a[i])  # store for each
                            # i = [0,elements/split_elemets - 1)
                            # totaling div  vse32

    + a scalar load and multiple stores for each element.

Counting up the vector code inserted together: 2 * div + 1
if all insns cost 1. This is the reasoning behind arbitrary constraint:
    s_assignments.length () / 2 > elements - (2 * div + 1)
For if there are a greater number of scalar assignments the code would
produce redundant vector instructions.

Tested on risc-v 64 and risc-v 32, no regressions.

PS. In the vector instruction example there are two add instructions
working on the stack pointer register. I'm not quite sure about the
purpose of these instructions.

2024-31-11  Dusan Stojkovic  <dusan.stojko...@rt-rk.com>

      PR target/116425

gcc/ChangeLog:

      * tree-vect-generic.cc (split_lower_vec_perm): New function.
      (lower_vec_perm): New if condition.

gcc/testsuite/ChangeLog:

      * gcc.target/riscv/rvv/autovec/pr116425-run.c: New test.
      * gcc.target/riscv/rvv/autovec/pr116425.c: New test.


CONFIDENTIALITY: The contents of this e-mail are confidential and intended only 
for the above addressee(s). If you are not the intended recipient, or the 
person responsible for delivering it to the intended recipient, copying or 
delivering it to anyone else or using it in any unauthorized manner is 
prohibited and may be unlawful. If you receive this e-mail by mistake, please 
notify the sender and the systems administrator at straym...@rt-rk.com 
immediately.

---
 .../riscv/rvv/autovec/pr116425-run.c          |  53 +++++
 .../gcc.target/riscv/rvv/autovec/pr116425.c   |  54 +++++
 gcc/tree-vect-generic.cc                      | 196 ++++++++++++++++++
 3 files changed, 303 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425-run.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425.c

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425-run.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425-run.c
new file mode 100644
index 00000000000..05a7f5a3824
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425-run.c
@@ -0,0 +1,53 @@
+/* { dg-do run } */
+/* { dg-require-effective-target rvv_zvl256b_ok }  */
+/* { dg-options "-std=c99 -O3 -march=rv64gcv_zvl256b -mabi=lp64d" } */
+
+#include "pr116425.c"
+
+#define comp(a, b, n)              \
+  for (unsigned i = 0; i < n; ++i) \
+    if ((a)[i] != (b)[i])          \
+      __builtin_abort ();          \
+
+int main ()
+{
+  a8 = (vnx8si) { 1, 2, 3, 4, 5, 6, 7, 8 };
+  b8 = (vnx8si) { 8, 7, 6, 5, 4, 3, 2, 1 };
+  a16 = (vnx16si) { 1, 2, 3, 4, 5, 6, 7, 8,
+                    9, 10, 11, 12, 13, 14, 15, 16 };
+  b16 = (vnx16si) { 16, 15, 14, 13, 12, 11, 10, 9,
+                    8, 7, 6, 5, 4, 3, 2, 1 };
+  a32 = (vnx32si) { 1, 2, 3, 4, 5, 6, 7, 8,
+                    9, 10, 11, 12, 13, 14, 15, 16,
+                    17, 18, 19, 20, 21, 22, 23, 24,
+                    25, 26, 27, 28, 29, 30, 31, 32 };
+  b32 = (vnx32si) { 32, 31, 30, 29, 28, 27, 26, 25,
+                    24, 23, 22, 21, 20, 19, 18, 17,
+                    16, 15, 14, 13, 12, 11, 10, 9,
+                    8, 7, 6, 5, 4, 3, 2, 1 };
+
+  permute_vnx8si (); 
+  comp (res8, ((vnx8si) { 2, 2, 2, 2, 2, 2, 2, 2 }), 8);
+
+  permute_vnx16si (); 
+  comp (res16, ((vnx16si) { 2, 2, 2, 2, 2, 2, 2, 2,
+                            2, 2, 2, 2, 2, 2, 2, 2 }), 16);
+
+  permute_vnx32si ();
+  comp (res32, ((vnx32si) { 2, 2, 2, 2, 2, 2, 2, 2,
+                            2, 2, 2, 2, 2, 2, 2, 2,
+                            2, 2, 2, 2, 2, 2, 2, 2,
+                            2, 2, 2, 2, 2, 2, 2, 2 }), 32);
+
+  permute_vnx32si1 ();
+  comp (res32, ((vnx32si) { 32, 32, 32, 32, 32, 32, 32, 32,
+                            32, 32, 32, 32, 32, 32, 32, 32,
+                            32, 32, 32, 32, 32, 32, 32, 32,
+                            32, 32, 32, 32, 32, 32, 32, 32 }), 32);
+
+  permute_vnx32si2 ();
+  comp (res32, ((vnx32si) { 32, 1, 31, 2, 30, 3, 29, 4,
+                            28, 5, 27, 6, 26, 7, 25, 8,
+                            24, 9, 23, 10, 22, 11, 21, 12,
+                            20, 13, 19, 14, 18, 15, 17, 16 }), 32);
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425.c
new file mode 100644
index 00000000000..f394c0e744b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr116425.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=rv64gcv_zvl256b -ftree-vectorize" } */
+
+typedef int int32_t;
+
+typedef int32_t vnx2si __attribute__ ((vector_size (8)));
+typedef int32_t vnx4si __attribute__ ((vector_size (16)));
+typedef int32_t vnx8si __attribute__ ((vector_size (32)));
+typedef int32_t vnx16si __attribute__ ((vector_size (64)));
+typedef int32_t vnx32si __attribute__ ((vector_size (128)));
+
+vnx8si res8, a8, b8;
+vnx16si res16, a16, b16;
+vnx32si res32, a32, b32;
+
+__attribute__ ((noipa)) void permute_vnx8si () {
+    vnx8si v = 
+        __builtin_shufflevector (a8, b8, 1, 1, 1, 1, 1, 1, 1, 1);
+    res8 = v;
+} 
+__attribute__ ((noipa)) void permute_vnx16si () {
+    vnx16si v = 
+        __builtin_shufflevector (a16, b16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1);
+    res16 = v;
+} 
+__attribute__ ((noipa)) void permute_vnx32si () {
+    vnx32si v = 
+        __builtin_shufflevector (a32, b32, 
+                1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1);
+    res32 = v;
+} 
+__attribute__ ((noipa)) void permute_vnx32si1 () {
+    vnx32si v = 
+        __builtin_shufflevector (a32, b32, 
+                32, 32, 32, 32, 32, 32, 32, 32, 
+                32, 32, 32, 32, 32, 32, 32, 32, 
+                32, 32, 32, 32, 32, 32, 32, 32, 
+                32, 32, 32, 32, 32, 32, 32, 32);
+    res32 = v;
+}
+__attribute__ ((noipa)) void permute_vnx32si2 () {
+    vnx32si v = 
+        __builtin_shufflevector (a32, b32, 
+                32, 0, 33, 1, 34, 2, 35, 3,
+                36, 4, 37, 5, 38, 6, 39, 7,
+                40, 8, 41, 9, 42, 10, 43, 11,
+                44, 12, 45, 13, 46, 14, 47, 15 );
+    res32 = v;
+}
+/* { dg-final { scan-assembler-times {vrgather\.vi} 4 } } */
+/* { dg-final { scan-assembler-times {vrgather\.vv} 4 } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 78f6e552cc7..e1498ae7777 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1381,6 +1381,200 @@ vector_element (gimple_stmt_iterator *gsi, tree vect, 
tree idx, tree *ptmpvec)
                  idx, NULL_TREE, NULL_TREE);
 }
 
+/* In cases where lower_vec_perm fails to vectorize permutation vectors
+   this function attempts to split the original vector type into a type
+   which can be vectorized.  It achieves this in multiple steps:
+    1.  Find the store statement associated with VEC_PERM_EXR;
+    2.  Find optimal divisor, create split_type = orig_vec_type / divisor;
+    3.  Split lowering VEC_PERM_EXPR indexes are grouped into two categories:
+       a) Values v between 0 and split_elements or elements and
+       elements + split_elements since these values can be vectorized
+       for the new split_type vector.
+       b) Values which don't fit in these bounds are emited as scalars.
+       (if too many values are to be emitted as scalar load/stores,
+       abandon split_lower_vec_perm).
+    4.  Create new VEC_PERM_EXPRs and SSA names for sub vector variables.
+    5.  Build new GIMPLE statements:
+       a) Temporary BIT_FIELD_REF variables;
+       b) New VEC_PERM_EXPRs of split_type;
+       c) MEM_REF stores offsetted by split_type_t
+       d) Scalar loads/stores of values for indexes outside of the range in 3a.
+    6.  Cleanup old GIMPLE statements, insert new ones.  */
+
+static bool
+split_lower_vec_perm (gimple_stmt_iterator *gsi, tree orig_vec_type,
+               tree vec0, tree vec1,
+               tree mask, unsigned HOST_WIDE_INT elements)
+{
+  // Try to find related store store stmt before going forward
+  tree lhs_store = NULL_TREE;
+  gimple_stmt_iterator store;
+  tree store_ptr;
+  tree lhs = gimple_assign_lhs (gsi_stmt (*gsi));
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, lhs)
+  {
+    gimple *use = USE_STMT (use_p);
+    if (gimple_code (use) == GIMPLE_ASSIGN)
+    {
+      tree tmp = gimple_assign_lhs (use);
+      lhs_store = tmp;
+      store = gsi_for_stmt (use);
+      if (TREE_CODE (tmp) == MEM_REF)
+      {
+                   store_ptr = TREE_OPERAND_CHECK (lhs_store, 0);
+      }
+      else
+      {
+                   store_ptr = build1 (ADDR_EXPR, orig_vec_type, lhs_store);
+      }
+    }
+  }
+  if (lhs_store == NULL_TREE)
+               return false;
+
+  /* Before lowering using scalar instructions, first try to find a type which
+  can be expanded using SIMD extensions of the CPU.
+  Try to find a split_type which is a valid vector type given the
+  constraints of HW, failing to do so will not use VEC_PERM_EXR.  */
+  tree split_type;
+  unsigned HOST_WIDE_INT split_elements = elements;
+  unsigned HOST_WIDE_INT int divisor = 1;
+  do
+  {
+    divisor = divisor << 1;
+    split_elements = split_elements >> 1;
+    split_type = build_vector_type (TREE_TYPE (orig_vec_type),
+               exact_div (TYPE_VECTOR_SUBPARTS (orig_vec_type), divisor));
+    if (split_elements == 1)
+                   return false;
+  }while (!can_vec_perm_var_p (TYPE_MODE (split_type)));
+
+  unsigned HOST_WIDE_INT i, j;
+  unsigned HOST_WIDE_INT base_type_t;
+  unsigned HOST_WIDE_INT split_type_t;
+  auto_vec<gassign*> v_assignments;
+  auto_vec<gassign*> s_assignments;
+  auto_vec<vec<constructor_elt, va_gc>*> vcv;
+  tree t0 = NULL_TREE, t1 = NULL_TREE, t = NULL_TREE, masktmp = NULL_TREE;
+  tree split_type_ptr = build_pointer_type (split_type);
+  tree base_type = TREE_TYPE (split_type);
+  tree base_type_ptr = build_pointer_type (base_type);
+  tree offset_ptr, offset_int;
+  base_type_t = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (base_type));
+  split_type_t = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (split_type));
+  unsigned HOST_WIDE_INT split_type_bits = CHAR_TYPE_SIZE * split_type_t;
+  tree split_type_width = build_int_cst (bitsizetype, split_type_bits);
+  for (i = 0; i < divisor; i++)
+  {
+    vec<constructor_elt, va_gc> *vc;
+    vec_alloc (vc, split_elements);
+    vcv.safe_push (vc);
+    for (j = 0; j < split_elements; j++)
+    {
+      bool scalar = false;
+      unsigned HOST_WIDE_INT offset_bytes = 0;
+      tree vec;
+      tree si = size_int (i * split_elements + j);
+      tree t = vector_element (gsi, mask, si, &masktmp);
+      unsigned HOST_WIDE_INT t_i = TREE_INT_CST_LOW (t);
+      if (t_i >= split_elements && t_i < elements)
+      {
+                               scalar = true;
+                               t = build_int_cst (base_type, 0);
+                               vec = vec0;
+      }
+      else if (t_i >= elements && t_i < elements + split_elements)
+      {
+                   t_i = t_i - elements + split_elements;
+                   t = build_int_cst (base_type, t_i);
+      }
+      else if (t_i >= elements + split_elements)
+      {
+                   scalar = true;
+                   t_i -= elements;
+                   t = build_int_cst (base_type, 0);
+                   vec = vec1;
+      }
+
+      CONSTRUCTOR_APPEND_ELT (vcv[i], NULL_TREE, t);
+      if (scalar)
+      {
+                   gassign* load_stmt;
+                   offset_bytes = i * split_type_t + j * base_type_t;
+                   offset_ptr = build_int_cst (base_type_ptr, offset_bytes);
+                   offset_bytes = t_i * split_type_t;
+                   offset_int = build_int_cst (bitsizetype, offset_bytes);
+                   tree sub_bit_v0 = build3 (BIT_FIELD_REF,
+                                   TREE_TYPE (split_type),
+                                   vec,
+                                   build_int_cst (bitsizetype, split_type_t),
+                                   offset_int);
+                   tree new_lhs = build2 (MEM_REF,
+                                   TREE_TYPE (split_type),
+                                   store_ptr,
+                                   offset_ptr);
+                   t = make_ssa_name (TREE_TYPE (split_type));
+                       load_stmt = gimple_build_assign (t, sub_bit_v0);
+                   s_assignments.safe_push (load_stmt);
+                   s_assignments.safe_push (gimple_build_assign (new_lhs, t));
+      }
+    }
+  }
+
+  // If true, too many scalar instructions, abandon split lowering.
+  if (s_assignments.length () / 2 > elements - 2 * (divisor + 1))
+    return false;
+
+  offset_int = build_int_cst (bitsizetype, 0);
+  for (i = 0; i < divisor; i++)
+  {
+    t0 = make_ssa_name (split_type);
+    t1 = make_ssa_name (split_type);
+
+    /* Calculate the address for storing new lowered vector
+    Build store of lowered vector to *(out + offset) */
+    tree offset_ptr = build_int_cst (split_type_ptr, i * split_type_t);
+    tree bit_field_ref_t0 = build3 (BIT_FIELD_REF,
+                   split_type, vec0, split_type_width, offset_int);
+    tree bit_field_ref_t1 = build3 (BIT_FIELD_REF,
+                   split_type, vec1, split_type_width, offset_int);
+
+    v_assignments.safe_push (gimple_build_assign (t0, bit_field_ref_t0));
+    v_assignments.safe_push (gimple_build_assign (t1, bit_field_ref_t1));
+    tree sel = build_vector_from_ctor (split_type, vcv[i]);
+    vec_perm_builder sel_int;
+    if (TREE_CODE (sel) == VECTOR_CST
+      && tree_to_vec_perm_builder (&sel_int, sel))
+    {
+      vec_perm_indices indices (sel_int, 2, split_elements);
+      tree vec_mask = vect_gen_perm_mask_checked (split_type, indices);
+      tree vec_perm_expr = build3 (VEC_PERM_EXPR,
+                                   split_type, t0, t1, vec_mask);
+
+      t = make_ssa_name (split_type);
+      v_assignments.safe_push (gimple_build_assign (t, vec_perm_expr));
+
+      tree new_lhs = build2 (MEM_REF,
+                                   split_type, store_ptr,  offset_ptr);
+      v_assignments.safe_push (gimple_build_assign (new_lhs, t));
+    }
+  }
+  gsi_remove (gsi, true);
+  for (i = 0; i < v_assignments.length (); i++)
+  {
+    gsi_insert_before (gsi, v_assignments[i], GSI_SAME_STMT);
+  }
+  for (i = 0; i < s_assignments.length (); i++)
+  {
+    gsi_insert_before (gsi, s_assignments[i], GSI_SAME_STMT);
+  }
+  gsi_remove (&store, true);
+
+  return true;
+}
+
 /* Check if VEC_PERM_EXPR within the given setting is supported
    by hardware, or lower it piecewise.
 
@@ -1503,6 +1696,9 @@ lower_vec_perm (gimple_stmt_iterator *gsi)
     warning_at (loc, OPT_Wvector_operation_performance,
                "vector shuffling operation will be expanded piecewise");
 
+  if (split_lower_vec_perm (gsi, res_vect_type, vec0, vec1, mask, elements))
+      return;
+
   vec_alloc (v, elements);
   bool constant_p = true;
   for (i = 0; i < elements; i++)
-- 
2.43.0

[PATCH] RISC-V: Introduce vector lowering of VEC_PERM_EXPR for large vector types

Reply via email to