from:"Sergei Lewis"

[PATCH v2 0/3] RISC-V: vectorised memory operations

2023-12-19 Thread Sergei Lewis

This patchset permits generation of inlined vectorised code for movmem, 
setmem and cmpmem, if and only if the operation size is 
at least one and at most eight vector registers' worth of data.

Further vectorisation rapidly becomes debatable due to code size concerns;
however, for these simple cases we do have an unambiguous performance win 
without sacrificing too much code size compared to a libc call.

Changes in v2:

* run clang-format over the code in addition to the 
  contrib/check_GNU_style.sh that was used for v1

* remove string.h include and refer to __builtin_* memory functions 
  in multilib tests

* respect stringop_strategy (don't vectorise if it doesn't include VECTOR)

* use an integer constraint for movmem length parameter

* use TARGET_MAX_LMUL unless riscv-autovec-lmul=dynamic 
  to ensure we respect the user's wishes if they request specific lmul

* add new unit tests to check that riscv-autovec-lmul is respected

* PR target/112109 added to changelog for patch 1/3 as requested

Sergei Lewis (3):
  RISC-V: movmem for RISCV with V extension
  RISC-V: setmem for RISCV with V extension
  RISC-V: cmpmem for RISCV with V extension

 gcc/config/riscv/riscv-protos.h   |   2 +
 gcc/config/riscv/riscv-string.cc  | 190 ++
 gcc/config/riscv/riscv.md |  51 +
 .../gcc.target/riscv/rvv/base/cmpmem-1.c  |  88 
 .../gcc.target/riscv/rvv/base/cmpmem-2.c  |  74 +++
 .../gcc.target/riscv/rvv/base/cmpmem-3.c  |  45 +
 .../gcc.target/riscv/rvv/base/cmpmem-4.c  |  62 ++
 .../gcc.target/riscv/rvv/base/movmem-1.c  |  60 ++
 .../gcc.target/riscv/rvv/base/setmem-1.c  | 103 ++
 .../gcc.target/riscv/rvv/base/setmem-2.c  |  51 +
 .../gcc.target/riscv/rvv/base/setmem-3.c  |  69 +++
 11 files changed, 795 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-4.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c

-- 
2.34.1

[PATCH v2 1/3] RISC-V: movmem for RISCV with V extension

2023-12-19 Thread Sergei Lewis

gcc/ChangeLog

* config/riscv/riscv.md (movmem): Use riscv_vector::expand_block_move,
if and only if we know the entire operation can be performed using one 
vector
load followed by one vector store

gcc/testsuite/ChangeLog

PR target/112109
* gcc.target/riscv/rvv/base/movmem-1.c: New test
---
 gcc/config/riscv/riscv.md | 22 +++
 .../gcc.target/riscv/rvv/base/movmem-1.c  | 60 +++
 2 files changed, 82 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ee8b71c22aa..1b3f66fd15c 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2365,6 +2365,28 @@
 FAIL;
 })
 
+;; Inlining general memmove is a pessimisation: we can't avoid having to decide
+;; which direction to go at runtime, which is costly in instruction count
+;; however for situations where the entire move fits in one vector operation
+;; we can do all reads before doing any writes so we don't have to worry
+;; so generate the inline vector code in such situations
+;; nb. prefer scalar path for tiny memmoves.
+(define_expand "movmem"
+  [(parallel [(set (match_operand:BLK 0 "general_operand")
+   (match_operand:BLK 1 "general_operand"))
+(use (match_operand:P 2 "const_int_operand"))
+(use (match_operand:SI 3 "const_int_operand"))])]
+  "TARGET_VECTOR"
+{
+  if ((INTVAL (operands[2]) >= TARGET_MIN_VLEN/8)
+   && (INTVAL (operands[2]) <= TARGET_MIN_VLEN)
+   && riscv_vector::expand_block_move (operands[0], operands[1],
+operands[2]))
+DONE;
+  else
+FAIL;
+})
+
 ;; Expand in-line code to clear the instruction cache between operand[0] and
 ;; operand[1].
 (define_expand "clear_cache"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
new file mode 100644
index 000..0ecc3f7e3b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
@@ -0,0 +1,60 @@
+/* { dg-do compile } */
+/* { dg-add-options riscv_v } */
+/* { dg-additional-options "-O3 --param=riscv-autovec-lmul=dynamic" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8)
+
+/* Tiny memmoves should not be vectorised.
+** f1:
+**  li\s+a2,\d+
+**  tail\s+memmove
+*/
+char *
+f1 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES - 1);
+}
+
+/* Vectorise+inline minimum vector register width with LMUL=1
+** f2:
+**  (
+**  vsetivli\s+zero,16,e8,m1,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma
+**  )
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char *
+f2 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES);
+}
+
+/* Vectorise+inline up to LMUL=8
+** f3:
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char *
+f3 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES * 8);
+}
+
+/* Don't vectorise if the move is too large for one operation
+** f4:
+**  li\s+a2,\d+
+**  tail\s+memmove
+*/
+char *
+f4 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES * 8 + 1);
+}
-- 
2.34.1

[PATCH v2 3/3] RISC-V: cmpmem for RISCV with V extension

2023-12-19 Thread Sergei Lewis

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_vector::expand_vec_cmpmem): New 
function
declaration.

* config/riscv/riscv-string.cc (riscv_vector::expand_vec_cmpmem): New
function; this generates an inline vectorised memory compare, if and only if
we know the entire operation can be performed in a single vector load per
input

* config/riscv/riscv.md (cmpmemsi): Try riscv_vector::expand_vec_cmpmem for
constant lengths

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/cmpmem-1.c: New codegen tests
* gcc.target/riscv/rvv/base/cmpmem-2.c: New execution tests
* gcc.target/riscv/rvv/base/cmpmem-3.c: New codegen tests
* gcc.target/riscv/rvv/base/cmpmem-4.c: New codegen tests
---
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-string.cc  | 100 ++
 gcc/config/riscv/riscv.md |  15 +++
 .../gcc.target/riscv/rvv/base/cmpmem-1.c  |  88 +++
 .../gcc.target/riscv/rvv/base/cmpmem-2.c  |  74 +
 .../gcc.target/riscv/rvv/base/cmpmem-3.c  |  45 
 .../gcc.target/riscv/rvv/base/cmpmem-4.c  |  62 +++
 7 files changed, 385 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-4.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index c4531589300..301aa9b8889 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -638,6 +638,7 @@ void expand_rawmemchr (machine_mode, rtx, rtx, rtx, bool = 
false);
 bool expand_strcmp (rtx, rtx, rtx, rtx, unsigned HOST_WIDE_INT, bool);
 void emit_vec_extract (rtx, rtx, rtx);
 bool expand_vec_setmem (rtx, rtx, rtx, rtx);
+bool expand_vec_cmpmem (rtx, rtx, rtx, rtx);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index e506b92a552..3b634851753 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1337,4 +1337,104 @@ expand_vec_setmem (rtx dst_in, rtx length_in, rtx 
fill_value_in,
   return true;
 }
 
+/* Used by cmpmemsi in riscv.md.  */
+
+bool
+expand_vec_cmpmem (rtx result_out, rtx blk_a_in, rtx blk_b_in, rtx length_in)
+{
+  HOST_WIDE_INT lmul;
+  /* Check we are able and allowed to vectorise this operation;
+ bail if not.  */
+  if (!check_vectorise_memory_operation (length_in, lmul))
+return false;
+
+  /* Strategy:
+ load entire blocks at a and b into vector regs
+ generate mask of bytes that differ
+ find first set bit in mask
+ find offset of first set bit in mask, use 0 if none set
+ result is ((char*)a[offset] - (char*)b[offset])
+   */
+
+  machine_mode vmode
+  = riscv_vector::get_vector_mode (QImode, BYTES_PER_RISCV_VECTOR * lmul)
+ .require ();
+  rtx blk_a_addr = copy_addr_to_reg (XEXP (blk_a_in, 0));
+  rtx blk_a = change_address (blk_a_in, vmode, blk_a_addr);
+  rtx blk_b_addr = copy_addr_to_reg (XEXP (blk_b_in, 0));
+  rtx blk_b = change_address (blk_b_in, vmode, blk_b_addr);
+
+  rtx vec_a = gen_reg_rtx (vmode);
+  rtx vec_b = gen_reg_rtx (vmode);
+
+  machine_mode mask_mode = get_mask_mode (vmode);
+  rtx mask = gen_reg_rtx (mask_mode);
+  rtx mismatch_ofs = gen_reg_rtx (Pmode);
+
+  rtx ne = gen_rtx_NE (mask_mode, vec_a, vec_b);
+  rtx vmsops[] = { mask, ne, vec_a, vec_b };
+  rtx vfops[] = { mismatch_ofs, mask };
+
+  /* If the length is exactly vlmax for the selected mode, do that.
+ Otherwise, use a predicated store.  */
+
+  if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+{
+  emit_move_insn (vec_a, blk_a);
+  emit_move_insn (vec_b, blk_b);
+  emit_vlmax_insn (code_for_pred_cmp (vmode), riscv_vector::COMPARE_OP,
+  vmsops);
+
+  emit_vlmax_insn (code_for_pred_ffs (mask_mode, Pmode),
+  riscv_vector::CPOP_OP, vfops);
+}
+  else
+{
+  if (!satisfies_constraint_K (length_in))
+ length_in = force_reg (Pmode, length_in);
+
+  rtx memmask = CONSTM1_RTX (mask_mode);
+
+  rtx m_ops_a[] = { vec_a, memmask, blk_a };
+  rtx m_ops_b[] = { vec_b, memmask, blk_b };
+
+  emit_nonvlmax_insn (code_for_pred_mov (vmode),
+ riscv_vector::UNARY_OP_TAMA, m_ops_a, length_in);
+  emit_nonvlmax_insn (code_for_pred_mov (vmode),
+ riscv_vector::UNARY_OP_TAMA, m_ops_b, length_in);
+
+  emit_nonvlmax_insn (code_for_pred_cmp (vmode), riscv_vector::COMPARE_OP,
+ vmsops, length_in);
+
+  emit_nonvlmax_insn (code_for_pred_ffs (mask_mode, Pmode),
+ riscv_vector::CPOP_OP, vfops, length_in);
+}
+
+  /*

[PATCH v2 2/3] RISC-V: setmem for RISCV with V extension

2023-12-19 Thread Sergei Lewis

gcc/ChangeLog

* config/riscv/riscv-protos.h (riscv_vector::expand_vec_setmem): New 
function
declaration.

* config/riscv/riscv-string.cc (riscv_vector::expand_vec_setmem): New
function: this generates an inline vectorised memory set, if and only if we
know the entire operation can be performed in a single vector store

* config/riscv/riscv.md (setmem): Try riscv_vector::expand_vec_setmem
for constant lengths

gcc/testsuite/ChangeLog
* gcc.target/riscv/rvv/base/setmem-1.c: New tests
* gcc.target/riscv/rvv/base/setmem-2.c: New tests
* gcc.target/riscv/rvv/base/setmem-3.c: New tests
---
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-string.cc  |  90 +++
 gcc/config/riscv/riscv.md |  14 +++
 .../gcc.target/riscv/rvv/base/setmem-1.c  | 103 ++
 .../gcc.target/riscv/rvv/base/setmem-2.c  |  51 +
 .../gcc.target/riscv/rvv/base/setmem-3.c  |  69 
 6 files changed, 328 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index eaee53ce94e..c4531589300 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -637,6 +637,7 @@ void expand_popcount (rtx *);
 void expand_rawmemchr (machine_mode, rtx, rtx, rtx, bool = false);
 bool expand_strcmp (rtx, rtx, rtx, rtx, unsigned HOST_WIDE_INT, bool);
 void emit_vec_extract (rtx, rtx, rtx);
+bool expand_vec_setmem (rtx, rtx, rtx, rtx);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 11c1f74d0b3..e506b92a552 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1247,4 +1247,94 @@ expand_strcmp (rtx result, rtx src1, rtx src2, rtx 
nbytes,
   return true;
 }
 
+/* Check we are permitted to vectorise a memory operation.
+   If so, return true and populate lmul_out.
+   Otherwise, return false and leave lmul_out unchanged.  */
+static bool
+check_vectorise_memory_operation (rtx length_in, HOST_WIDE_INT &lmul_out)
+{
+  /* If we either can't or have been asked not to vectorise, respect this.  */
+  if (!TARGET_VECTOR)
+return false;
+  if (!(stringop_strategy & STRATEGY_VECTOR))
+return false;
+
+  /* If we can't reason about the length, don't vectorise.  */
+  if (!CONST_INT_P (length_in))
+return false;
+
+  HOST_WIDE_INT length = INTVAL (length_in);
+
+  /* If it's tiny, default operation is likely better; maybe worth
+ considering fractional lmul in the future as well.  */
+  if (length < (TARGET_MIN_VLEN / 8))
+return false;
+
+  /* If we've been asked to use a specific LMUL,
+ check the operation fits and do that.  */
+  if (riscv_autovec_lmul != RVV_DYNAMIC)
+{
+  lmul_out = TARGET_MAX_LMUL;
+  return (length <= ((TARGET_MAX_LMUL * TARGET_MIN_VLEN) / 8));
+}
+
+  /* Find smallest lmul large enough for entire op.  */
+  HOST_WIDE_INT lmul = 1;
+  while ((lmul <= 8) && (length > ((lmul * TARGET_MIN_VLEN) / 8)))
+{
+  lmul <<= 1;
+}
+
+  if (lmul > 8)
+return false;
+
+  lmul_out = lmul;
+  return true;
+}
+
+/* Used by setmemdi in riscv.md.  */
+bool
+expand_vec_setmem (rtx dst_in, rtx length_in, rtx fill_value_in,
+  rtx alignment_in)
+{
+  HOST_WIDE_INT lmul;
+  /* Check we are able and allowed to vectorise this operation;
+ bail if not.  */
+  if (!check_vectorise_memory_operation (length_in, lmul))
+return false;
+
+  machine_mode vmode
+  = riscv_vector::get_vector_mode (QImode, BYTES_PER_RISCV_VECTOR * lmul)
+   .require ();
+  rtx dst_addr = copy_addr_to_reg (XEXP (dst_in, 0));
+  rtx dst = change_address (dst_in, vmode, dst_addr);
+
+  rtx fill_value = gen_reg_rtx (vmode);
+  rtx broadcast_ops[] = { fill_value, fill_value_in };
+
+  /* If the length is exactly vlmax for the selected mode, do that.
+ Otherwise, use a predicated store.  */
+  if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+{
+  emit_vlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP,
+ broadcast_ops);
+  emit_move_insn (dst, fill_value);
+}
+  else
+{
+  if (!satisfies_constraint_K (length_in))
+ length_in = force_reg (Pmode, length_in);
+  emit_nonvlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP,
+ broadcast_ops, length_in);
+  machine_mode mask_mode
+ = riscv_vector::get_vector_mode (BImode, GET_MODE_NUNITS (vmode))
+ .require ();
+  rtx mask = CONSTM1_RTX (mask_mode);
+  emit_insn (gen_pred_store (vmode, dst, mask, fill_value, length_in,
+

Re: [PATCH v2 1/3] RISC-V: movmem for RISCV with V extension

2023-12-20 Thread Sergei Lewis

Hi,

this patchset has been tested with the following configurations:

rv64gcv_zvl128b
rv64gcv_zvl256b
rv32imafd_zve32x1p0
rv32gc_zve64f_zvl128b

Will fix the formatting in v3.

Thanks

On Wed, Dec 20, 2023 at 5:28 AM Jeff Law  wrote:

>
>
> On 12/19/23 02:53, Sergei Lewis wrote:
> > gcc/ChangeLog
> >
> >  * config/riscv/riscv.md (movmem): Use
> riscv_vector::expand_block_move,
> >  if and only if we know the entire operation can be performed using
> one vector
> >  load followed by one vector store
> >
> > gcc/testsuite/ChangeLog
> >
> >  PR target/112109
> >  * gcc.target/riscv/rvv/base/movmem-1.c: New test
> So this needs to be regression tested.  Given that it only affects RVV,
> I would suggest testing rv64gcv or rv32gcv.
>
>
>
> > +(define_expand "movmem"
> > +  [(parallel [(set (match_operand:BLK 0 "general_operand")
> > +   (match_operand:BLK 1 "general_operand"))
> > +(use (match_operand:P 2 "const_int_operand"))
> > +(use (match_operand:SI 3 "const_int_operand"))])]
> > +  "TARGET_VECTOR"
> > +{
> > +  if ((INTVAL (operands[2]) >= TARGET_MIN_VLEN/8)
> > + && (INTVAL (operands[2]) <= TARGET_MIN_VLEN)
> > + && riscv_vector::expand_block_move (operands[0], operands[1],
> > +  operands[2]))
> > +DONE;
> > +  else
> > +FAIL;
> > +})
> Just a formatting nit.  A space on each side of the '/' operator above.
>
>
> Jeff
>

Re: [PATCH v2 2/3] RISC-V: setmem for RISCV with V extension

2023-12-20 Thread Sergei Lewis

Hi,

This has been tested with the following configurations:
rv64gcv_zvl128b
rv64gcv_zvl256b
rv32imafd_zve32x1p0
rv32gc_zve64f_zvl128b

I'll drop the constraints and add the testing info to the cover email in
v3. I'll hold off submitting v3 until gcc-15 as requested.

On Wed, Dec 20, 2023 at 5:38 AM Jeff Law  wrote:

>
>
> On 12/19/23 02:53, Sergei Lewis wrote:
> > gcc/ChangeLog
> >
> >  * config/riscv/riscv-protos.h (riscv_vector::expand_vec_setmem):
> New function
> >  declaration.
> >
> >  * config/riscv/riscv-string.cc (riscv_vector::expand_vec_setmem):
> New
> >  function: this generates an inline vectorised memory set, if and
> only if we
> >  know the entire operation can be performed in a single vector store
> >
> >  * config/riscv/riscv.md (setmem): Try
> riscv_vector::expand_vec_setmem
> >  for constant lengths
> >
> > gcc/testsuite/ChangeLog
> >  * gcc.target/riscv/rvv/base/setmem-1.c: New tests
> >  * gcc.target/riscv/rvv/base/setmem-2.c: New tests
> >  * gcc.target/riscv/rvv/base/setmem-3.c: New tests
> As with patch 1/3 this needs to be regression tested.  The other
> concern, which I should have voiced with patch 1/3 is that this was
> submitted after the gcc-14 development window closed.  While we do have
> some degrees of freedom to accept backend specific new features, we
> really shouldn't be adding new features/optimizations at this point.  We
> really should just be fixing bugs and new features should be queued for
> gcc-15.
>
>
>
>
> > diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> > index 1b3f66fd15c..dd34211ca80 100644
> > --- a/gcc/config/riscv/riscv.md
> > +++ b/gcc/config/riscv/riscv.md
> > @@ -2387,6 +2387,20 @@
> >   FAIL;
> >   })
> >
> > +(define_expand "setmemsi"
> > +  [(set (match_operand:BLK 0 "memory_operand") ;; Dest
> > +   (match_operand:QI  2 "nonmemory_operand")) ;; Value
> > +   (use (match_operand:SI  1 "const_int_operand")) ;; Length
> > +   (match_operand:SI   3 "const_int_operand")] ;; Align
> > +  "TARGET_VECTOR"
> > +{
> > +  if (riscv_vector::expand_vec_setmem (operands[0], operands[1],
> operands[2],
> > +  operands[3]))
> > +DONE;
> > +  else
> > +FAIL;
> > +})
> Is the :SI really needed for operands1 and operands3?  a CONST_INT node
> never has a mode.Or is the existence of the mode just to keep the
> gen* programs from generating a warning?  And if we're going to keep a
> mode, particularly on the length, shouldn't the length be in mode P?
>
>
> Jeff
>

[PATCH 0/3] RISC-V: vectorised memory operations

2023-12-11 Thread Sergei Lewis

This patchset permits generation of inlined vectorised code for movmem, 
setmem and cmpmem, if and only if the operation size is 
at least one and at most eight vector registers' worth of data.

Further vectorisation rapidly becomes debatable due to code size concerns;
however, for these simple cases we do have an unambiguous performance win 
without sacrificing too much code size compared to a libc call.

Signed-off-by: Sergei Lewis 

---

Sergei Lewis (3):
  RISC-V: movmem for RISCV with V extension
  RISC-V: setmem for RISCV with V extension
  RISC-V: cmpmem for RISCV with V extension

 gcc/config/riscv/riscv-protos.h   |   2 +
 gcc/config/riscv/riscv-string.cc  | 193 ++
 gcc/config/riscv/riscv.md |  51 +
 .../gcc.target/riscv/rvv/base/cmpmem-1.c  |  85 
 .../gcc.target/riscv/rvv/base/cmpmem-2.c  |  69 +++
 .../gcc.target/riscv/rvv/base/movmem-1.c  |  59 ++
 .../gcc.target/riscv/rvv/base/setmem-1.c  |  99 +
 7 files changed, 558 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c

-- 
2.34.1

[PATCH 2/3] RISC-V: setmem for RISCV with V extension

2023-12-11 Thread Sergei Lewis

gcc/ChangeLog

* config/riscv/riscv-protos.h (riscv_vector::expand_vec_setmem): New 
function
declaration.

* config/riscv/riscv-string.cc (riscv_vector::expand_vec_setmem): New
function: this generates an inline vectorised memory set, if and only if we
know the entire operation can be performed in a single vector store

* config/riscv/riscv.md (setmem): Try riscv_vector::expand_vec_setmem
for constant lengths

gcc/testsuite/ChangeLog
* gcc.target/riscv/rvv/base/setmem-1.c: New tests
---
 gcc/config/riscv/riscv-protos.h   |  1 +
 gcc/config/riscv/riscv-string.cc  | 82 +++
 gcc/config/riscv/riscv.md | 14 +++
 .../gcc.target/riscv/rvv/base/setmem-1.c  | 99 +++
 4 files changed, 196 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 20bbb5b859c..950cb65c910 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -560,6 +560,7 @@ void expand_popcount (rtx *);
 void expand_rawmemchr (machine_mode, rtx, rtx, rtx, bool = false);
 bool expand_strcmp (rtx, rtx, rtx, rtx, unsigned HOST_WIDE_INT, bool);
 void emit_vec_extract (rtx, rtx, poly_int64);
+bool expand_vec_setmem (rtx, rtx, rtx, rtx);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 11c1f74d0b3..0abbd5f8b28 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1247,4 +1247,86 @@ expand_strcmp (rtx result, rtx src1, rtx src2, rtx 
nbytes,
   return true;
 }
 
+
+/* Select appropriate LMUL for a single vector operation based on
+   byte size of data to be processed.
+   On success, return true and populate lmul_out.
+   If length_in is too wide for a single vector operation, return false
+   and leave lmul_out unchanged.  */
+
+static bool
+select_appropriate_lmul (HOST_WIDE_INT length_in,
+HOST_WIDE_INT &lmul_out)
+{
+  /* if it's tiny, default operation is likely better; maybe worth
+ considering fractional lmul in the future as well.  */
+  if (length_in < (TARGET_MIN_VLEN/8))
+return false;
+
+  /* find smallest lmul large enough for entire op.  */
+  HOST_WIDE_INT lmul = 1;
+  while ((lmul <= 8) && (length_in > ((lmul*TARGET_MIN_VLEN)/8)))
+{
+  lmul <<= 1;
+}
+
+  if (lmul > 8)
+return false;
+
+  lmul_out = lmul;
+  return true;
+}
+
+/* Used by setmemdi in riscv.md.  */
+bool
+expand_vec_setmem (rtx dst_in, rtx length_in, rtx fill_value_in,
+ rtx alignment_in)
+{
+  /* we're generating vector code.  */
+  if (!TARGET_VECTOR)
+return false;
+  /* if we can't reason about the length, let libc handle the operation.  */
+  if (!CONST_INT_P (length_in))
+return false;
+
+  HOST_WIDE_INT length = INTVAL (length_in);
+  HOST_WIDE_INT lmul;
+
+  /* select an lmul such that the data just fits into one vector operation;
+ bail if we can't.  */
+  if (!select_appropriate_lmul (length, lmul))
+return false;
+
+  machine_mode vmode = riscv_vector::get_vector_mode (QImode,
+ BYTES_PER_RISCV_VECTOR * lmul).require ();
+  rtx dst_addr = copy_addr_to_reg (XEXP (dst_in, 0));
+  rtx dst = change_address (dst_in, vmode, dst_addr);
+
+  rtx fill_value = gen_reg_rtx (vmode);
+  rtx broadcast_ops[] = {fill_value, fill_value_in};
+
+  /* If the length is exactly vlmax for the selected mode, do that.
+ Otherwise, use a predicated store.  */
+  if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+{
+  emit_vlmax_insn (code_for_pred_broadcast (vmode),
+ UNARY_OP, broadcast_ops);
+  emit_move_insn (dst, fill_value);
+}
+  else
+{
+  if (!satisfies_constraint_K (length_in))
+ length_in= force_reg (Pmode, length_in);
+  emit_nonvlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP,
+ broadcast_ops, length_in);
+  machine_mode mask_mode = riscv_vector::get_vector_mode
+ (BImode, GET_MODE_NUNITS (vmode)).require ();
+  rtx mask =  CONSTM1_RTX (mask_mode);
+  emit_insn (gen_pred_store (vmode, dst, mask, fill_value, length_in,
+ get_avl_type_rtx (riscv_vector::NONVLMAX)));
+}
+
+  return true;
+}
+
 }
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 88fde290a8a..29d3b1aa342 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2381,6 +2381,20 @@
 FAIL;
 })
 
+(define_expand "setmemsi"
+  [(set (match_operand:BLK 0 "memory_operand") ;; Dest
+ (match_operand:QI  2 "nonmemory_operand")) ;; Value
+   (use (match_operand:SI  1 "const_int_operand")) ;; Length
+   (match_operand:SI   3 "const_int_operand")] ;; Align
+  "TARGET_VECTOR"
+{
+  if (riscv_vector::expand_vec_setmem (operands[0], operands[1], operands[2],
+  operands[3]))
+DO

[PATCH 1/3] RISC-V: movmem for RISCV with V extension

2023-12-11 Thread Sergei Lewis

gcc/ChangeLog

* config/riscv/riscv.md (movmem): Use riscv_vector::expand_block_move,
if and only if we know the entire operation can be performed using one 
vector
load followed by one vector store

gcc/testsuite/ChangeLog

* gcc.target/riscv/rvv/base/movmem-1.c: New test
---
 gcc/config/riscv/riscv.md | 22 +++
 .../gcc.target/riscv/rvv/base/movmem-1.c  | 59 +++
 2 files changed, 81 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index eed997116b0..88fde290a8a 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2359,6 +2359,28 @@
 FAIL;
 })
 
+;; inlining general memmove is a pessimisation: we can't avoid having to decide
+;; which direction to go at runtime, which is costly in instruction count
+;; however for situations where the entire move fits in one vector operation
+;; we can do all reads before doing any writes so we don't have to worry
+;; so generate the inline vector code in such situations
+;; nb. prefer scalar path for tiny memmoves
+(define_expand "movmem"
+  [(parallel [(set (match_operand:BLK 0 "general_operand")
+  (match_operand:BLK 1 "general_operand"))
+ (use (match_operand:P 2 ""))
+ (use (match_operand:SI 3 "const_int_operand"))])]
+  "TARGET_VECTOR"
+{
+  if ((INTVAL (operands[2]) >= TARGET_MIN_VLEN/8)
+   && (INTVAL (operands[2]) <= TARGET_MIN_VLEN)
+   && riscv_vector::expand_block_move (operands[0], operands[1],
+operands[2]))
+DONE;
+  else
+FAIL;
+})
+
 ;; Expand in-line code to clear the instruction cache between operand[0] and
 ;; operand[1].
 (define_expand "clear_cache"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
new file mode 100644
index 000..b930241ae5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
@@ -0,0 +1,59 @@
+/* { dg-do compile } */
+/* { dg-add-options riscv_v } */
+/* { dg-additional-options "-O3" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include 
+
+#define MIN_VECTOR_BYTES (__riscv_v_min_vlen/8)
+
+/* tiny memmoves should not be vectorised
+** f1:
+**  li\s+a2,15
+**  tail\s+memmove
+*/
+char * f1 (char *a, char const *b)
+{
+  return memmove (a, b, 15);
+}
+
+/* vectorise+inline minimum vector register width with LMUL=1
+** f2:
+**  (
+**  vsetivli\s+zero,16,e8,m1,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma
+**  )
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char * f2 (char *a, char const *b)
+{
+  return memmove (a, b, MIN_VECTOR_BYTES);
+}
+
+/* vectorise+inline up to LMUL=8
+** f3:
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char * f3 (char *a, char const *b)
+{
+  return memmove (a, b, MIN_VECTOR_BYTES*8);
+}
+
+/* don't vectorise if the move is too large for one operation
+** f4:
+**  li\s+a2,\d+
+**  tail\s+memmove
+*/
+char * f4 (char *a, char const *b)
+{
+  return memmove (a, b, MIN_VECTOR_BYTES*8+1);
+}
+
-- 
2.34.1

[PATCH 3/3] RISC-V: cmpmem for RISCV with V extension

2023-12-11 Thread Sergei Lewis

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_vector::expand_vec_cmpmem): New 
function
declaration.

* config/riscv/riscv-string.cc (riscv_vector::expand_vec_cmpmem): New
function; this generates an inline vectorised memory compare, if and only if
we know the entire operation can be performed in a single vector load per
input

* config/riscv/riscv.md (cmpmemsi): Try riscv_vector::expand_vec_cmpmem for
constant lengths

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/cmpmem-1.c: New codegen tests
* gcc.target/riscv/rvv/base/cmpmem-2.c: New execution tests
---
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-string.cc  | 111 ++
 gcc/config/riscv/riscv.md |  15 +++
 .../gcc.target/riscv/rvv/base/cmpmem-1.c  |  85 ++
 .../gcc.target/riscv/rvv/base/cmpmem-2.c  |  69 +++
 5 files changed, 281 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/cmpmem-2.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 950cb65c910..72378438552 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -561,6 +561,7 @@ void expand_rawmemchr (machine_mode, rtx, rtx, rtx, bool = 
false);
 bool expand_strcmp (rtx, rtx, rtx, rtx, unsigned HOST_WIDE_INT, bool);
 void emit_vec_extract (rtx, rtx, poly_int64);
 bool expand_vec_setmem (rtx, rtx, rtx, rtx);
+bool expand_vec_cmpmem (rtx, rtx, rtx, rtx);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 0abbd5f8b28..6128565310b 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1329,4 +1329,115 @@ expand_vec_setmem (rtx dst_in, rtx length_in, rtx 
fill_value_in,
   return true;
 }
 
+
+/* Used by cmpmemsi in riscv.md.  */
+
+bool
+expand_vec_cmpmem (rtx result_out, rtx blk_a_in, rtx blk_b_in, rtx length_in)
+{
+  /* we're generating vector code.  */
+  if (!TARGET_VECTOR)
+return false;
+  /* if we can't reason about the length, let libc handle the operation.  */
+  if (!CONST_INT_P (length_in))
+return false;
+
+  HOST_WIDE_INT length = INTVAL (length_in);
+  HOST_WIDE_INT lmul;
+
+  /* select an lmul such that the data just fits into one vector operation;
+ bail if we can't.  */
+  if (!select_appropriate_lmul (length, lmul))
+return false;
+
+  /* strategy:
+ load entire blocks at a and b into vector regs
+ generate mask of bytes that differ
+ find first set bit in mask
+ find offset of first set bit in mask, use 0 if none set
+ result is ((char*)a[offset] - (char*)b[offset])
+   */
+
+  machine_mode vmode = riscv_vector::get_vector_mode (QImode,
+   BYTES_PER_RISCV_VECTOR * lmul).require ();
+  rtx blk_a_addr = copy_addr_to_reg (XEXP (blk_a_in, 0));
+  rtx blk_a = change_address (blk_a_in, vmode, blk_a_addr);
+  rtx blk_b_addr = copy_addr_to_reg (XEXP (blk_b_in, 0));
+  rtx blk_b = change_address (blk_b_in, vmode, blk_b_addr);
+
+  rtx vec_a = gen_reg_rtx (vmode);
+  rtx vec_b = gen_reg_rtx (vmode);
+
+  machine_mode mask_mode = get_mask_mode (vmode);
+  rtx mask = gen_reg_rtx (mask_mode);
+  rtx mismatch_ofs = gen_reg_rtx (Pmode);
+
+  rtx ne = gen_rtx_NE (mask_mode, vec_a, vec_b);
+  rtx vmsops[] = {mask, ne, vec_a, vec_b};
+  rtx vfops[] = {mismatch_ofs, mask};
+
+  /* If the length is exactly vlmax for the selected mode, do that.
+ Otherwise, use a predicated store.  */
+
+  if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+{
+  emit_move_insn (vec_a, blk_a);
+  emit_move_insn (vec_b, blk_b);
+  emit_vlmax_insn (code_for_pred_cmp (vmode),
+ riscv_vector::COMPARE_OP, vmsops);
+
+  emit_vlmax_insn (code_for_pred_ffs (mask_mode, Pmode),
+ riscv_vector::CPOP_OP, vfops);
+}
+  else
+{
+  if (!satisfies_constraint_K (length_in))
+ length_in= force_reg (Pmode, length_in);
+
+  rtx memmask =  CONSTM1_RTX (mask_mode);
+
+  rtx m_ops_a[] = {vec_a, memmask, blk_a};
+  rtx m_ops_b[] = {vec_b, memmask, blk_b};
+
+  emit_nonvlmax_insn (code_for_pred_mov (vmode),
+ riscv_vector::UNARY_OP_TAMA, m_ops_a, length_in);
+  emit_nonvlmax_insn (code_for_pred_mov (vmode),
+ riscv_vector::UNARY_OP_TAMA, m_ops_b, length_in);
+
+  emit_nonvlmax_insn (code_for_pred_cmp (vmode),
+ riscv_vector::COMPARE_OP, vmsops, length_in);
+
+  emit_nonvlmax_insn (code_for_pred_ffs (mask_mode, Pmode),
+ riscv_vector::CPOP_OP, vfops, length_in);
+}
+
+  /* mismatch_ofs is -1 if blocks match, or the offset of
+ the first mismatch otherwise.  */
+rtx ltz = gen_reg_rtx (Xmode);
+emit_insn (gen_slt_3 (LT, Xmode, Xmode, ltz, mismatch_ofs, const0_rtx

Re: [PATCH 2/3] RISC-V: setmem for RISCV with V extension

2023-12-11 Thread Sergei Lewis

The thinking here is that using the largest possible LMUL when we know the
operation will fit in fewer registers potentially leaves performance on the
table - indirectly, due to the unnecessarily increased register pressure,
and also directly, depending on the implementation.

On Mon, Dec 11, 2023 at 10:05 AM juzhe.zh...@rivai.ai 
wrote:

> Hi, Thanks for contributing this.
>
> +/* Select appropriate LMUL for a single vector operation based on
> +   byte size of data to be processed.
> +   On success, return true and populate lmul_out.
> +   If length_in is too wide for a single vector operation, return false
> +   and leave lmul_out unchanged.  */
> +
> +static bool
> +select_appropriate_lmul (HOST_WIDE_INT length_in,
> +HOST_WIDE_INT &lmul_out)
> +{
>
> I don't think we need this, you only need to use TARGET_MAX_LMUL
>
>
> --
> juzhe.zh...@rivai.ai
>

Re: Re: [PATCH 2/3] RISC-V: setmem for RISCV with V extension

2023-12-11 Thread Sergei Lewis

...oh, and keep the current approach if riscv-autovec-lmul=dynamic.
Makes perfect sense - thanks!

On Mon, Dec 11, 2023 at 3:01 PM 钟居哲  wrote:

> I think we should leave it to user choice.
>
> --param=riscv-autovec-lmul=m1/m2/m4/m8/dynamic.
>
> So use TARGET_MAX_LMUL should be more reasonable.
>
> --
> juzhe.zh...@rivai.ai
>
>
> *From:* Sergei Lewis 
> *Date:* 2023-12-11 22:58
> *To:* juzhe.zh...@rivai.ai
> *CC:* gcc-patches ; Robin Dapp
> ; Kito.cheng ; jeffreyalaw
> 
> *Subject:* Re: [PATCH 2/3] RISC-V: setmem for RISCV with V extension
> The thinking here is that using the largest possible LMUL when we know the
> operation will fit in fewer registers potentially leaves performance on the
> table - indirectly, due to the unnecessarily increased register pressure,
> and also directly, depending on the implementation.
>
> On Mon, Dec 11, 2023 at 10:05 AM juzhe.zh...@rivai.ai <
> juzhe.zh...@rivai.ai> wrote:
>
>> Hi, Thanks for contributing this.
>>
>> +/* Select appropriate LMUL for a single vector operation based on
>> +   byte size of data to be processed.
>> +   On success, return true and populate lmul_out.
>> +   If length_in is too wide for a single vector operation, return false
>> +   and leave lmul_out unchanged.  */
>> +
>> +static bool
>> +select_appropriate_lmul (HOST_WIDE_INT length_in,
>> +HOST_WIDE_INT &lmul_out)
>> +{
>>
>> I don't think we need this, you only need to use TARGET_MAX_LMUL
>>
>>
>> --
>> juzhe.zh...@rivai.ai
>>
>

[PATCH v2 0/3] RISC-V: vectorised memory operations

[PATCH v2 1/3] RISC-V: movmem for RISCV with V extension

[PATCH v2 3/3] RISC-V: cmpmem for RISCV with V extension

[PATCH v2 2/3] RISC-V: setmem for RISCV with V extension

Re: [PATCH v2 1/3] RISC-V: movmem for RISCV with V extension

Re: [PATCH v2 2/3] RISC-V: setmem for RISCV with V extension

[PATCH 0/3] RISC-V: vectorised memory operations

[PATCH 2/3] RISC-V: setmem for RISCV with V extension

[PATCH 1/3] RISC-V: movmem for RISCV with V extension

[PATCH 3/3] RISC-V: cmpmem for RISCV with V extension

Re: [PATCH 2/3] RISC-V: setmem for RISCV with V extension

Re: Re: [PATCH 2/3] RISC-V: setmem for RISCV with V extension

12 matches

Site Navigation

Mail list logo

Footer information