On Mon, Jun 16, 2025 at 12:19 AM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> >
> > Perhaps someone is interested in the following thread from LKML:
> >
> > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops"
> >
> > https://lore.kernel.org/lkml/20250605164733.737543-1-mjgu...@gmail.com/
> >
> > There are several PRs regarding memcpy/memset linked from the above message.
> >
> > Please also note a message from Linus from the above thread:
> >
> > https://lore.kernel.org/lkml/CAHk-=wg1qqlwkpyvxxznxwbot48--lkjucjjf8phdhrxv0u...@mail.gmail.com/
>
> This is my understanding of the situation.
> Please correct me where I am wrong.
>
> According to Linus, the calls in kernel are more expensive then
> elsewhere due to mitigations.  I wonder if -minline-all-stringops
> would make sense here.
>
> Linus writes about the alternate entryopint for memcpy with non-standard
> calling convention, which we also discussed few times in the past.
> I think having call convention for memset/memcpy that only clobbers
> SI/DE/CX and nothing else (especially no SSE regs) makes sense.
>
> This should make offlined mempcy noticeably cheaper, specially when
> called from loops that needs SSE and the implmentation can be done w/o
> cloberring extra registers for small blocks while it will have enoug
> time to spill for large ones.
>
> The other patch does
> +KBUILD_CFLAGS += 
> -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> +KBUILD_CFLAGS += 
> -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> for non-native CPUs (so something we should fix for generic tuning).
>
> Which is about our current default to rep stosq that does not work well
> on Intel hardware. We do loop for blocks up to 32bytes and rep stosq up
> to 8k.
>
> We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel cores, but
> no changes for generic yet (it is on my TODO to do some more testing on
> Zen).
>
> So I think we can do following:
>   1) decide whether to go with X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB
>      or relpace rep_prefix_8_byte by unrolled_loop
>   2) fix issue with repeated constants. I.e. instead
>
>        movq $0, ....
>        movq $0, ....
>        ....
>        movq $0, ....
>       Which we currently generate for memset fitting in CLEAR_RATIO by
>        mov $0, tmpreg
>        movq tmpreg, ....
>        movq tmpreg, ....
>        ....
>        movq tmpreg, ....
>       Which will make memset sequences smaller.  I agree with Richi that HJ's
>       patch that adds new cloar block expander is probably not a right place
>       for solving the problem.
>
>       Ideall we should catch repeated constants more generally since
>       this appears elsewhere too.
>       I am not quite sure where to fit it best.  We already have a
>       machine specific task that loads 0 into SSE register which is kind
>       of similar to this as well.
>   3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults
>   4) Possibly go with the entry point idea?
> Honza

Here is the v3 patch.  It no longer uses "rep mov/stos".   Lili, can you measure
its performance impact on Intel and AMD cpus?

The updated generic has

Update memcpy and memset inline strategies for -mtune=generic:

1. Don't align memory.
2. For known sizes, unroll loop with 4 moves or stores per iteration
   without aligning the loop, up to 256 bytes.
3. For unknown sizes, use memcpy/memset.
4. Since each loop iteration has 4 stores and 8 stores for zeroing with
   unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
   up to 72 bytes are fully unrolled with 9 stores without SSE.

Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
with the fixed epilogue size to enable overlapping moves and stores.

gcc/

PR target/102294
PR target/119596
PR target/119703
PR target/119704
* builtins.cc (builtin_memset_gen_str): Make it global.
* builtins.h (builtin_memset_gen_str): New.
* config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
move_by_pieces.
(expand_setmem_epilogue): Use store_by_pieces.
(ix86_expand_set_or_cpymem): Pass val_exp, instead of
vec_promoted_val, to expand_setmem_epilogue.
* config/i386/x86-tune-costs.h (generic_memcpy): Updated.
(generic_memset): Likewise.
(generic_cost): Change CLEAR_RATIO to 10.

gcc/testsuite/

PR target/102294
PR target/119596
PR target/119703
PR target/119704
* gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
* gcc.target/i386/auto-init-padding-9.c: Expect loop.
* gcc.target/i386/memcpy-strategy-12.c: New test.
* gcc.target/i386/memcpy-strategy-13.c: Likewise.
* gcc.target/i386/memset-strategy-25.c: Likewise.
* gcc.target/i386/memset-strategy-26.c: Likewise.
* gcc.target/i386/memset-strategy-27.c: Likewise.
* gcc.target/i386/memset-strategy-28.c: Likewise.
* gcc.target/i386/memset-strategy-29.c: Likewise.
* gcc.target/i386/memset-strategy-30.c: Likewise.
* gcc.target/i386/memset-strategy-31.c: Likewise.
* gcc.target/i386/mvc17.c: Fail with "rep mov"
* gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
with "rep mov".
* gcc.target/i386/shrink_wrap_1.c: Also pass
-mmemset-strategy=rep_8byte:-1:align.
* gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.


-- 
H.J.
From bcd7245314d3ba4eb55e9ea2bc0b7d165834f5b6 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.to...@gmail.com>
Date: Thu, 18 Mar 2021 18:43:10 -0700
Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for
 -mtune=generic

Update memcpy and memset inline strategies for -mtune=generic:

1. Don't align memory.
2. For known sizes, unroll loop with 4 moves or stores per iteration
   without aligning the loop, up to 256 bytes.
3. For unknown sizes, use memcpy/memset.
4. Since each loop iteration has 4 stores and 8 stores for zeroing with
   unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
   up to 72 bytes are fully unrolled with 9 stores without SSE.

Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
with the fixed epilogue size to enable overlapping moves and stores.

gcc/

	PR target/102294
	PR target/119596
	PR target/119703
	PR target/119704
	* builtins.cc (builtin_memset_gen_str): Make it global.
	* builtins.h (builtin_memset_gen_str): New.
	* config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
	move_by_pieces.
	(expand_setmem_epilogue): Use store_by_pieces.
	(ix86_expand_set_or_cpymem): Pass val_exp, instead of
	vec_promoted_val, to expand_setmem_epilogue.
	* config/i386/x86-tune-costs.h (generic_memcpy): Updated.
	(generic_memset): Likewise.
	(generic_cost): Change CLEAR_RATIO to 10.

gcc/testsuite/

	PR target/102294
	PR target/119596
	PR target/119703
	PR target/119704
	* gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
	* gcc.target/i386/auto-init-padding-9.c: Expect loop.
	* gcc.target/i386/memcpy-strategy-12.c: New test.
	* gcc.target/i386/memcpy-strategy-13.c: Likewise.
	* gcc.target/i386/memset-strategy-25.c: Likewise.
	* gcc.target/i386/memset-strategy-26.c: Likewise.
	* gcc.target/i386/memset-strategy-27.c: Likewise.
	* gcc.target/i386/memset-strategy-28.c: Likewise.
	* gcc.target/i386/memset-strategy-29.c: Likewise.
	* gcc.target/i386/memset-strategy-30.c: Likewise.
	* gcc.target/i386/memset-strategy-31.c: Likewise.
	* gcc.target/i386/mvc17.c: Fail with "rep mov"
	* gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
	with "rep mov".
	* gcc.target/i386/shrink_wrap_1.c: Also pass
	-mmemset-strategy=rep_8byte:-1:align.
	* gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.

Signed-off-by: H.J. Lu <hjl.to...@gmail.com>
---
 gcc/builtins.cc                               |  2 +-
 gcc/builtins.h                                |  2 +
 gcc/config/i386/i386-expand.cc                | 47 +++++--------------
 gcc/config/i386/x86-tune-costs.h              | 35 +++++++++-----
 .../gcc.target/i386/auto-init-padding-3.c     |  7 +--
 .../gcc.target/i386/auto-init-padding-9.c     | 25 ++++++++--
 .../gcc.target/i386/memcpy-strategy-12.c      | 43 +++++++++++++++++
 .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++
 .../gcc.target/i386/memset-strategy-25.c      | 29 ++++++++++++
 .../gcc.target/i386/memset-strategy-26.c      | 15 ++++++
 .../gcc.target/i386/memset-strategy-27.c      | 11 +++++
 .../gcc.target/i386/memset-strategy-28.c      | 29 ++++++++++++
 .../gcc.target/i386/memset-strategy-29.c      | 34 ++++++++++++++
 .../gcc.target/i386/memset-strategy-30.c      | 35 ++++++++++++++
 .../gcc.target/i386/memset-strategy-31.c      | 28 +++++++++++
 gcc/testsuite/gcc.target/i386/mvc17.c         |  2 +-
 gcc/testsuite/gcc.target/i386/pr111657-1.c    | 24 +++++++++-
 gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
 gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
 19 files changed, 322 insertions(+), 61 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-25.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-26.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-27.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-28.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-29.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-30.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-31.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 3064bff1ae6..e9c9f8eeab3 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -4268,7 +4268,7 @@ builtin_memset_read_str (void *data, void *prev,
    4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
    nullptr, it has the RTL info from the previous iteration.  */
 
-static rtx
+rtx
 builtin_memset_gen_str (void *data, void *prev,
 			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
 			fixed_size_mode mode)
diff --git a/gcc/builtins.h b/gcc/builtins.h
index 5a553a9c836..b552aee3905 100644
--- a/gcc/builtins.h
+++ b/gcc/builtins.h
@@ -160,6 +160,8 @@ extern char target_percent_c[3];
 extern char target_percent_s_newline[4];
 extern bool target_char_cst_p (tree t, char *p);
 extern rtx get_memory_rtx (tree exp, tree len);
+extern rtx builtin_memset_gen_str (void *, void *, HOST_WIDE_INT,
+				   fixed_size_mode mode);
 
 extern internal_fn associated_internal_fn (combined_fn, tree);
 extern internal_fn associated_internal_fn (tree);
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 82e9f035d11..b7d181b7ffc 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -8221,19 +8221,11 @@ expand_cpymem_epilogue (rtx destmem, rtx srcmem,
   rtx src, dest;
   if (CONST_INT_P (count))
     {
-      HOST_WIDE_INT countval = INTVAL (count);
-      HOST_WIDE_INT epilogue_size = countval % max_size;
-      int i;
-
-      /* For now MAX_SIZE should be a power of 2.  This assert could be
-	 relaxed, but it'll require a bit more complicated epilogue
-	 expanding.  */
-      gcc_assert ((max_size & (max_size - 1)) == 0);
-      for (i = max_size; i >= 1; i >>= 1)
-	{
-	  if (epilogue_size & i)
-	    destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
-	}
+      unsigned HOST_WIDE_INT countval = UINTVAL (count);
+      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
+      unsigned int destalign = MEM_ALIGN (destmem);
+      move_by_pieces (destmem, srcmem, epilogue_size, destalign,
+		      RETURN_BEGIN);
       return;
     }
   if (max_size > 8)
@@ -8396,31 +8388,18 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 
 /* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx vec_value,
-			rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value,
+			rtx origin_vaule, rtx count, int max_size)
 {
   rtx dest;
 
   if (CONST_INT_P (count))
     {
-      HOST_WIDE_INT countval = INTVAL (count);
-      HOST_WIDE_INT epilogue_size = countval % max_size;
-      int i;
-
-      /* For now MAX_SIZE should be a power of 2.  This assert could be
-	 relaxed, but it'll require a bit more complicated epilogue
-	 expanding.  */
-      gcc_assert ((max_size & (max_size - 1)) == 0);
-      for (i = max_size; i >= 1; i >>= 1)
-	{
-	  if (epilogue_size & i)
-	    {
-	      if (vec_value && i > GET_MODE_SIZE (GET_MODE (value)))
-		destmem = emit_memset (destmem, destptr, vec_value, i);
-	      else
-		destmem = emit_memset (destmem, destptr, value, i);
-	    }
-	}
+      unsigned HOST_WIDE_INT countval = UINTVAL (count);
+      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
+      unsigned int destalign = MEM_ALIGN (destmem);
+      store_by_pieces (destmem, epilogue_size, builtin_memset_gen_str,
+		       origin_vaule, destalign, true, RETURN_BEGIN);
       return;
     }
   if (max_size > 32)
@@ -9802,7 +9781,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx count_exp, rtx val_exp,
 	{
 	  if (issetmem)
 	    expand_setmem_epilogue (dst, destreg, promoted_val,
-				    vec_promoted_val, count_exp,
+				    val_exp, count_exp,
 				    epilogue_size_needed);
 	  else
 	    expand_cpymem_epilogue (dst, src, destreg, srcreg, count_exp,
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index b08081e37cf..e3d9381594b 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -4065,19 +4065,32 @@ struct processor_costs shijidadao_cost = {
 
 
 
-/* Generic should produce code tuned for Core-i7 (and newer chips)
-   and btver1 (and newer chips).  */
+/* Generic should produce code tuned for Haswell (and newer chips)
+   and znver1 (and newer chips):
+   1. Don't align memory.
+   2. For known sizes, unroll loop with 4 moves or stores per iteration
+      without aligning the loop, up to 256 bytes.
+   3. For unknown sizes, use memcpy/memset.
+   4. Since each loop iteration has 4 stores and 8 stores for zeroing
+      with unroll loop may be needed, change CLEAR_RATIO to 10 so that
+      zeroing up to 72 bytes are fully unrolled with 9 stores without
+      SSE.
+ */
 
 static stringop_algs generic_memcpy[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, unrolled_loop, true},
+    {-1, libcall, true}}},
+  {libcall,
+   {{256, unrolled_loop, true},
+    {-1, libcall, true}}}};
 static stringop_algs generic_memset[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, unrolled_loop, true},
+    {-1, libcall, true}}},
+  {libcall,
+   {{256, unrolled_loop, true},
+    {-1, libcall, true}}}};
 static const
 struct processor_costs generic_cost = {
   {
@@ -4134,7 +4147,7 @@ struct processor_costs generic_cost = {
   COSTS_N_INSNS (1),			/* cost of movzx */
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
-  6,					/* CLEAR_RATIO */
+  10,					/* CLEAR_RATIO */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
 					   Relative to reg-reg move (2).  */
diff --git a/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c b/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c
index 7c20a28508f..a12069a039d 100644
--- a/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c
+++ b/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c
@@ -23,8 +23,5 @@ int foo ()
   return var.four.internal1;
 }
 
-/* { dg-final { scan-assembler "movl\t\\\$0," } } */
-/* { dg-final { scan-assembler "movl\t\\\$16," { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler "rep stosq" { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler "movl\t\\\$32," { target ia32 } } } */
-/* { dg-final { scan-assembler "rep stosl" { target ia32 } } } */
+/* { dg-final { scan-assembler-times "pxor\t%xmm0, %xmm0" 1 } } */
+/* { dg-final { scan-assembler-times "movaps\t%xmm0, " 8 } } */
diff --git a/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c b/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c
index a87b68b255b..d7d0593db9c 100644
--- a/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c
+++ b/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c
@@ -2,6 +2,25 @@
    padding.  */ 
 /* { dg-do compile } */
 /* { dg-options "-ftrivial-auto-var-init=zero -march=x86-64" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**...
+**	movl	\$0, %ecx
+**...
+**.L[0-9]+:
+**	movl	%esi, %edx
+**	movq	%rcx, \(%rax,%rdx\)
+**	movq	%rcx, 8\(%rax,%rdx\)
+**	movq	%rcx, 16\(%rax,%rdx\)
+**	movq	%rcx, 24\(%rax,%rdx\)
+**	addl	\$32, %esi
+**	cmpl	%edi, %esi
+**	jb	.L[0-9]+
+**...
+*/
 
 struct test_trailing_hole {
         int one;
@@ -18,8 +37,4 @@ int foo ()
   return var[2].four;
 }
 
-/* { dg-final { scan-assembler "movl\t\\\$0," } } */
-/* { dg-final { scan-assembler "movl\t\\\$20," { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler "rep stosq" { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler "movl\t\\\$40," { target ia32} } } */
-/* { dg-final { scan-assembler "rep stosl" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
new file mode 100644
index 00000000000..22ed9ec6601
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**.LFB[0-9]+:
+**	.cfi_startproc
+**	xorl	%edx, %edx
+**.L[0-9]+:
+**	movl	%edx, %eax
+**	addl	\$32, %edx
+**	movq	\(%rsi,%rax\), %r10
+**	movq	8\(%rsi,%rax\), %r9
+**	movq	16\(%rsi,%rax\), %r8
+**	movq	24\(%rsi,%rax\), %rcx
+**	movq	%r10, \(%rdi,%rax\)
+**	movq	%r9, 8\(%rdi,%rax\)
+**	movq	%r8, 16\(%rdi,%rax\)
+**	movq	%rcx, 24\(%rdi,%rax\)
+**	cmpl	\$224, %edx
+**	jb	.L[0-9]+
+**	addq	%rdx, %rsi
+**	movq	\(%rsi\), %rax
+**	movq	%rax, \(%rdi,%rdx\)
+**	movq	8\(%rsi\), %rax
+**	movq	%rax, 8\(%rdi,%rdx\)
+**	movq	16\(%rsi\), %rax
+**	movq	%rax, 16\(%rdi,%rdx\)
+**	movq	21\(%rsi\), %rax
+**	movq	%rax, 21\(%rdi,%rdx\)
+**	ret
+**...
+*/
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 253);
+}
+
+/* { dg-final { scan-assembler-not "rep mov" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
new file mode 100644
index 00000000000..109bd675a51
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-avx" } */
+/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-25.c b/gcc/testsuite/gcc.target/i386/memset-strategy-25.c
new file mode 100644
index 00000000000..040439d1671
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-25.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**.LFB[0-9]+:
+**	.cfi_startproc
+**	xorl	%eax, %eax
+**.L[0-9]+:
+**	movl	%eax, %edx
+**	addl	\$32, %eax
+**	movq	\$0, \(%rdi,%rdx\)
+**	movq	\$0, 8\(%rdi,%rdx\)
+**	movq	\$0, 16\(%rdi,%rdx\)
+**	movq	\$0, 24\(%rdi,%rdx\)
+**	cmpl	\$224, %eax
+**	jb	.L[0-9]+
+**...
+*/
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 253);
+}
+
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-26.c b/gcc/testsuite/gcc.target/i386/memset-strategy-26.c
new file mode 100644
index 00000000000..c53bce52e17
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-26.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* { dg-final { scan-assembler-not "jmp\tmemset" } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+struct foo
+{
+  char buf[41];
+};
+
+void
+zero(struct foo *f)
+{
+  __builtin_memset(f->buf, 0, sizeof(f->buf));
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-27.c b/gcc/testsuite/gcc.target/i386/memset-strategy-27.c
new file mode 100644
index 00000000000..685d6e5a5c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-27.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-avx" } */
+/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-28.c b/gcc/testsuite/gcc.target/i386/memset-strategy-28.c
new file mode 100644
index 00000000000..1d173edf930
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-28.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**.LFB[0-9]+:
+**	.cfi_startproc
+**	movq	\$0, \(%rdi\)
+**	movq	\$0, 8\(%rdi\)
+**	movq	\$0, 16\(%rdi\)
+**	movq	\$0, 24\(%rdi\)
+**	movq	\$0, 32\(%rdi\)
+**	movq	\$0, 40\(%rdi\)
+**	movq	\$0, 48\(%rdi\)
+**	movq	\$0, 56\(%rdi\)
+**	movb	\$0, 64\(%rdi\)
+**	ret
+**...
+*/
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 65);
+}
+
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-29.c b/gcc/testsuite/gcc.target/i386/memset-strategy-29.c
new file mode 100644
index 00000000000..54aa03e6b35
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-29.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**...
+**.LFB[0-9]+:
+**	.cfi_startproc
+**	xorl	%eax, %eax
+**.L[0-9]+:
+**	movl	%eax, %edx
+**	addl	\$32, %eax
+**	movq	\$0, \(%rdi,%rdx\)
+**	movq	\$0, 8\(%rdi,%rdx\)
+**	movq	\$0, 16\(%rdi,%rdx\)
+**	movq	\$0, 24\(%rdi,%rdx\)
+**	cmpl	\$64, %eax
+**	jb	.L[0-9]+
+**	movq	\$0, \(%rdi,%rax\)
+**	movq	\$0, 8\(%rdi,%rax\)
+**	movb	\$0, 16\(%rdi,%rax\)
+**	ret
+**...
+*/
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 81);
+}
+
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-30.c b/gcc/testsuite/gcc.target/i386/memset-strategy-30.c
new file mode 100644
index 00000000000..4799adcef5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-30.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**...
+**.LFB[0-9]+:
+**	.cfi_startproc
+**	xorl	%eax, %eax
+**.L[0-9]+:
+**	movl	%eax, %edx
+**	addl	\$32, %eax
+**	movq	\$0, \(%rdi,%rdx\)
+**	movq	\$0, 8\(%rdi,%rdx\)
+**	movq	\$0, 16\(%rdi,%rdx\)
+**	movq	\$0, 24\(%rdi,%rdx\)
+**	cmpl	\$64, %eax
+**	jb	.L[0-9]+
+**	movq	\$0, 16\(%rdi,%rax\)
+**	movq	\$0, \(%rdi,%rax\)
+**	movq	\$0, 8\(%rdi,%rax\)
+**	movq	\$0, 23\(%rdi,%rax\)
+**	ret
+**...
+*/
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 95);
+}
+
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-31.c b/gcc/testsuite/gcc.target/i386/memset-strategy-31.c
new file mode 100644
index 00000000000..b2bb107b353
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-31.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-avx -msse2" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**foo:
+**.LFB[0-9]+:
+**...
+**.L[0-9]+:
+**	movl	%eax, %edx
+**	addl	\$32, %eax
+**	movq	\$0, \(%rdi,%rdx\)
+**	movq	\$0, 8\(%rdi,%rdx\)
+**	movq	\$0, 16\(%rdi,%rdx\)
+**	movq	\$0, 24\(%rdi,%rdx\)
+**	cmpl	\$224, %eax
+**	jb	.L[0-9]+
+**...
+*/
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 254);
+}
+
+/* { dg-final { scan-assembler-not "rep stos" } } */
diff --git a/gcc/testsuite/gcc.target/i386/mvc17.c b/gcc/testsuite/gcc.target/i386/mvc17.c
index 8b83c1aecb3..dbf35ac36dc 100644
--- a/gcc/testsuite/gcc.target/i386/mvc17.c
+++ b/gcc/testsuite/gcc.target/i386/mvc17.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-require-ifunc "" } */
 /* { dg-options "-O2 -march=x86-64" } */
-/* { dg-final { scan-assembler-times "rep mov" 1 } } */
+/* { dg-final { scan-assembler-not "rep mov" } } */
 
 __attribute__((target_clones("default","arch=icelake-server")))
 void
diff --git a/gcc/testsuite/gcc.target/i386/pr111657-1.c b/gcc/testsuite/gcc.target/i386/pr111657-1.c
index a4ba21073f5..fa9f4cfe5c5 100644
--- a/gcc/testsuite/gcc.target/i386/pr111657-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr111657-1.c
@@ -1,5 +1,26 @@
 /* { dg-do assemble } */
 /* { dg-options "-O2 -mno-sse -mtune=generic -save-temps" } */
+/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc').  */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */
+
+/*
+**bar:
+**...
+**.L[0-9]+:
+**	movl	%edx, %eax
+**	addl	\$32, %edx
+**	movq	%gs:m\(%rax\), %r9
+**	movq	%gs:m\+8\(%rax\), %r8
+**	movq	%gs:m\+16\(%rax\), %rsi
+**	movq	%gs:m\+24\(%rax\), %rcx
+**	movq	%r9, \(%rdi,%rax\)
+**	movq	%r8, 8\(%rdi,%rax\)
+**	movq	%rsi, 16\(%rdi,%rax\)
+**	movq	%rcx, 24\(%rdi,%rax\)
+**	cmpl	\$224, %edx
+**	jb	.L[0-9]+
+**...
+*/
 
 typedef unsigned long uword __attribute__ ((mode (word)));
 
@@ -8,5 +29,4 @@ struct a { uword arr[30]; };
 __seg_gs struct a m;
 void bar (struct a *dst) { *dst = m; }
 
-/* { dg-final { scan-assembler "gs\[ \t\]+rep\[; \t\]+movs(l|q)" { target { ! x32 } } } } */
-/* { dg-final { scan-assembler-not "gs\[ \t\]+rep\[; \t\]+movs(l|q)" { target x32 } } } */
+/* { dg-final { scan-assembler-not "rep movs" } } */
diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
index 4b286671e90..30b82ab695a 100644
--- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
+++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */
+/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */
 
 enum machine_mode
 {
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index b0432279644..14db3cee206 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */
+/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */
 /* { dg-additional-options "-mno-avx" { target ia32 } } */
 /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */
 
-- 
2.49.0

Reply via email to