On Mon, Jun 16, 2025 at 12:19 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > Perhaps someone is interested in the following thread from LKML: > > > > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops" > > > > https://lore.kernel.org/lkml/20250605164733.737543-1-mjgu...@gmail.com/ > > > > There are several PRs regarding memcpy/memset linked from the above message. > > > > Please also note a message from Linus from the above thread: > > > > https://lore.kernel.org/lkml/CAHk-=wg1qqlwkpyvxxznxwbot48--lkjucjjf8phdhrxv0u...@mail.gmail.com/ > > This is my understanding of the situation. > Please correct me where I am wrong. > > According to Linus, the calls in kernel are more expensive then > elsewhere due to mitigations. I wonder if -minline-all-stringops > would make sense here. > > Linus writes about the alternate entryopint for memcpy with non-standard > calling convention, which we also discussed few times in the past. > I think having call convention for memset/memcpy that only clobbers > SI/DE/CX and nothing else (especially no SSE regs) makes sense. > > This should make offlined mempcy noticeably cheaper, specially when > called from loops that needs SSE and the implmentation can be done w/o > cloberring extra registers for small blocks while it will have enoug > time to spill for large ones. > > The other patch does > +KBUILD_CFLAGS += > -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > +KBUILD_CFLAGS += > -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > for non-native CPUs (so something we should fix for generic tuning). > > Which is about our current default to rep stosq that does not work well > on Intel hardware. We do loop for blocks up to 32bytes and rep stosq up > to 8k. > > We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel cores, but > no changes for generic yet (it is on my TODO to do some more testing on > Zen). > > So I think we can do following: > 1) decide whether to go with X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB > or relpace rep_prefix_8_byte by unrolled_loop > 2) fix issue with repeated constants. I.e. instead > > movq $0, .... > movq $0, .... > .... > movq $0, .... > Which we currently generate for memset fitting in CLEAR_RATIO by > mov $0, tmpreg > movq tmpreg, .... > movq tmpreg, .... > .... > movq tmpreg, .... > Which will make memset sequences smaller. I agree with Richi that HJ's > patch that adds new cloar block expander is probably not a right place > for solving the problem. > > Ideall we should catch repeated constants more generally since > this appears elsewhere too. > I am not quite sure where to fit it best. We already have a > machine specific task that loads 0 into SSE register which is kind > of similar to this as well. > 3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults > 4) Possibly go with the entry point idea? > Honza
Here is the v3 patch. It no longer uses "rep mov/stos". Lili, can you measure its performance impact on Intel and AMD cpus? The updated generic has Update memcpy and memset inline strategies for -mtune=generic: 1. Don't align memory. 2. For known sizes, unroll loop with 4 moves or stores per iteration without aligning the loop, up to 256 bytes. 3. For unknown sizes, use memcpy/memset. 4. Since each loop iteration has 4 stores and 8 stores for zeroing with unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing up to 72 bytes are fully unrolled with 9 stores without SSE. Use move_by_pieces and store_by_pieces for memcpy and memset epilogues with the fixed epilogue size to enable overlapping moves and stores. gcc/ PR target/102294 PR target/119596 PR target/119703 PR target/119704 * builtins.cc (builtin_memset_gen_str): Make it global. * builtins.h (builtin_memset_gen_str): New. * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use move_by_pieces. (expand_setmem_epilogue): Use store_by_pieces. (ix86_expand_set_or_cpymem): Pass val_exp, instead of vec_promoted_val, to expand_setmem_epilogue. * config/i386/x86-tune-costs.h (generic_memcpy): Updated. (generic_memset): Likewise. (generic_cost): Change CLEAR_RATIO to 10. gcc/testsuite/ PR target/102294 PR target/119596 PR target/119703 PR target/119704 * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores. * gcc.target/i386/auto-init-padding-9.c: Expect loop. * gcc.target/i386/memcpy-strategy-12.c: New test. * gcc.target/i386/memcpy-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-25.c: Likewise. * gcc.target/i386/memset-strategy-26.c: Likewise. * gcc.target/i386/memset-strategy-27.c: Likewise. * gcc.target/i386/memset-strategy-28.c: Likewise. * gcc.target/i386/memset-strategy-29.c: Likewise. * gcc.target/i386/memset-strategy-30.c: Likewise. * gcc.target/i386/memset-strategy-31.c: Likewise. * gcc.target/i386/mvc17.c: Fail with "rep mov" * gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail with "rep mov". * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-strategy=rep_8byte:-1:align. * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. -- H.J.
From bcd7245314d3ba4eb55e9ea2bc0b7d165834f5b6 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.to...@gmail.com> Date: Thu, 18 Mar 2021 18:43:10 -0700 Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for -mtune=generic Update memcpy and memset inline strategies for -mtune=generic: 1. Don't align memory. 2. For known sizes, unroll loop with 4 moves or stores per iteration without aligning the loop, up to 256 bytes. 3. For unknown sizes, use memcpy/memset. 4. Since each loop iteration has 4 stores and 8 stores for zeroing with unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing up to 72 bytes are fully unrolled with 9 stores without SSE. Use move_by_pieces and store_by_pieces for memcpy and memset epilogues with the fixed epilogue size to enable overlapping moves and stores. gcc/ PR target/102294 PR target/119596 PR target/119703 PR target/119704 * builtins.cc (builtin_memset_gen_str): Make it global. * builtins.h (builtin_memset_gen_str): New. * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use move_by_pieces. (expand_setmem_epilogue): Use store_by_pieces. (ix86_expand_set_or_cpymem): Pass val_exp, instead of vec_promoted_val, to expand_setmem_epilogue. * config/i386/x86-tune-costs.h (generic_memcpy): Updated. (generic_memset): Likewise. (generic_cost): Change CLEAR_RATIO to 10. gcc/testsuite/ PR target/102294 PR target/119596 PR target/119703 PR target/119704 * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores. * gcc.target/i386/auto-init-padding-9.c: Expect loop. * gcc.target/i386/memcpy-strategy-12.c: New test. * gcc.target/i386/memcpy-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-25.c: Likewise. * gcc.target/i386/memset-strategy-26.c: Likewise. * gcc.target/i386/memset-strategy-27.c: Likewise. * gcc.target/i386/memset-strategy-28.c: Likewise. * gcc.target/i386/memset-strategy-29.c: Likewise. * gcc.target/i386/memset-strategy-30.c: Likewise. * gcc.target/i386/memset-strategy-31.c: Likewise. * gcc.target/i386/mvc17.c: Fail with "rep mov" * gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail with "rep mov". * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-strategy=rep_8byte:-1:align. * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. Signed-off-by: H.J. Lu <hjl.to...@gmail.com> --- gcc/builtins.cc | 2 +- gcc/builtins.h | 2 + gcc/config/i386/i386-expand.cc | 47 +++++-------------- gcc/config/i386/x86-tune-costs.h | 35 +++++++++----- .../gcc.target/i386/auto-init-padding-3.c | 7 +-- .../gcc.target/i386/auto-init-padding-9.c | 25 ++++++++-- .../gcc.target/i386/memcpy-strategy-12.c | 43 +++++++++++++++++ .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++ .../gcc.target/i386/memset-strategy-25.c | 29 ++++++++++++ .../gcc.target/i386/memset-strategy-26.c | 15 ++++++ .../gcc.target/i386/memset-strategy-27.c | 11 +++++ .../gcc.target/i386/memset-strategy-28.c | 29 ++++++++++++ .../gcc.target/i386/memset-strategy-29.c | 34 ++++++++++++++ .../gcc.target/i386/memset-strategy-30.c | 35 ++++++++++++++ .../gcc.target/i386/memset-strategy-31.c | 28 +++++++++++ gcc/testsuite/gcc.target/i386/mvc17.c | 2 +- gcc/testsuite/gcc.target/i386/pr111657-1.c | 24 +++++++++- gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- 19 files changed, 322 insertions(+), 61 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-25.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-26.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-27.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-28.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-29.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-30.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-31.c diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 3064bff1ae6..e9c9f8eeab3 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -4268,7 +4268,7 @@ builtin_memset_read_str (void *data, void *prev, 4 bytes wide, return the RTL for 0x01010101*data. If PREV isn't nullptr, it has the RTL info from the previous iteration. */ -static rtx +rtx builtin_memset_gen_str (void *data, void *prev, HOST_WIDE_INT offset ATTRIBUTE_UNUSED, fixed_size_mode mode) diff --git a/gcc/builtins.h b/gcc/builtins.h index 5a553a9c836..b552aee3905 100644 --- a/gcc/builtins.h +++ b/gcc/builtins.h @@ -160,6 +160,8 @@ extern char target_percent_c[3]; extern char target_percent_s_newline[4]; extern bool target_char_cst_p (tree t, char *p); extern rtx get_memory_rtx (tree exp, tree len); +extern rtx builtin_memset_gen_str (void *, void *, HOST_WIDE_INT, + fixed_size_mode mode); extern internal_fn associated_internal_fn (combined_fn, tree); extern internal_fn associated_internal_fn (tree); diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 82e9f035d11..b7d181b7ffc 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -8221,19 +8221,11 @@ expand_cpymem_epilogue (rtx destmem, rtx srcmem, rtx src, dest; if (CONST_INT_P (count)) { - HOST_WIDE_INT countval = INTVAL (count); - HOST_WIDE_INT epilogue_size = countval % max_size; - int i; - - /* For now MAX_SIZE should be a power of 2. This assert could be - relaxed, but it'll require a bit more complicated epilogue - expanding. */ - gcc_assert ((max_size & (max_size - 1)) == 0); - for (i = max_size; i >= 1; i >>= 1) - { - if (epilogue_size & i) - destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i); - } + unsigned HOST_WIDE_INT countval = UINTVAL (count); + unsigned HOST_WIDE_INT epilogue_size = countval % max_size; + unsigned int destalign = MEM_ALIGN (destmem); + move_by_pieces (destmem, srcmem, epilogue_size, destalign, + RETURN_BEGIN); return; } if (max_size > 8) @@ -8396,31 +8388,18 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value, /* Output code to set at most count & (max_size - 1) bytes starting by DEST. */ static void -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx vec_value, - rtx count, int max_size) +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, + rtx origin_vaule, rtx count, int max_size) { rtx dest; if (CONST_INT_P (count)) { - HOST_WIDE_INT countval = INTVAL (count); - HOST_WIDE_INT epilogue_size = countval % max_size; - int i; - - /* For now MAX_SIZE should be a power of 2. This assert could be - relaxed, but it'll require a bit more complicated epilogue - expanding. */ - gcc_assert ((max_size & (max_size - 1)) == 0); - for (i = max_size; i >= 1; i >>= 1) - { - if (epilogue_size & i) - { - if (vec_value && i > GET_MODE_SIZE (GET_MODE (value))) - destmem = emit_memset (destmem, destptr, vec_value, i); - else - destmem = emit_memset (destmem, destptr, value, i); - } - } + unsigned HOST_WIDE_INT countval = UINTVAL (count); + unsigned HOST_WIDE_INT epilogue_size = countval % max_size; + unsigned int destalign = MEM_ALIGN (destmem); + store_by_pieces (destmem, epilogue_size, builtin_memset_gen_str, + origin_vaule, destalign, true, RETURN_BEGIN); return; } if (max_size > 32) @@ -9802,7 +9781,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx count_exp, rtx val_exp, { if (issetmem) expand_setmem_epilogue (dst, destreg, promoted_val, - vec_promoted_val, count_exp, + val_exp, count_exp, epilogue_size_needed); else expand_cpymem_epilogue (dst, src, destreg, srcreg, count_exp, diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index b08081e37cf..e3d9381594b 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -4065,19 +4065,32 @@ struct processor_costs shijidadao_cost = { -/* Generic should produce code tuned for Core-i7 (and newer chips) - and btver1 (and newer chips). */ +/* Generic should produce code tuned for Haswell (and newer chips) + and znver1 (and newer chips): + 1. Don't align memory. + 2. For known sizes, unroll loop with 4 moves or stores per iteration + without aligning the loop, up to 256 bytes. + 3. For unknown sizes, use memcpy/memset. + 4. Since each loop iteration has 4 stores and 8 stores for zeroing + with unroll loop may be needed, change CLEAR_RATIO to 10 so that + zeroing up to 72 bytes are fully unrolled with 9 stores without + SSE. + */ static stringop_algs generic_memcpy[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, unrolled_loop, true}, + {-1, libcall, true}}}, + {libcall, + {{256, unrolled_loop, true}, + {-1, libcall, true}}}}; static stringop_algs generic_memset[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, unrolled_loop, true}, + {-1, libcall, true}}}, + {libcall, + {{256, unrolled_loop, true}, + {-1, libcall, true}}}}; static const struct processor_costs generic_cost = { { @@ -4134,7 +4147,7 @@ struct processor_costs generic_cost = { COSTS_N_INSNS (1), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 10, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c b/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c index 7c20a28508f..a12069a039d 100644 --- a/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c +++ b/gcc/testsuite/gcc.target/i386/auto-init-padding-3.c @@ -23,8 +23,5 @@ int foo () return var.four.internal1; } -/* { dg-final { scan-assembler "movl\t\\\$0," } } */ -/* { dg-final { scan-assembler "movl\t\\\$16," { target { ! ia32 } } } } */ -/* { dg-final { scan-assembler "rep stosq" { target { ! ia32 } } } } */ -/* { dg-final { scan-assembler "movl\t\\\$32," { target ia32 } } } */ -/* { dg-final { scan-assembler "rep stosl" { target ia32 } } } */ +/* { dg-final { scan-assembler-times "pxor\t%xmm0, %xmm0" 1 } } */ +/* { dg-final { scan-assembler-times "movaps\t%xmm0, " 8 } } */ diff --git a/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c b/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c index a87b68b255b..d7d0593db9c 100644 --- a/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c +++ b/gcc/testsuite/gcc.target/i386/auto-init-padding-9.c @@ -2,6 +2,25 @@ padding. */ /* { dg-do compile } */ /* { dg-options "-ftrivial-auto-var-init=zero -march=x86-64" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**... +** movl \$0, %ecx +**... +**.L[0-9]+: +** movl %esi, %edx +** movq %rcx, \(%rax,%rdx\) +** movq %rcx, 8\(%rax,%rdx\) +** movq %rcx, 16\(%rax,%rdx\) +** movq %rcx, 24\(%rax,%rdx\) +** addl \$32, %esi +** cmpl %edi, %esi +** jb .L[0-9]+ +**... +*/ struct test_trailing_hole { int one; @@ -18,8 +37,4 @@ int foo () return var[2].four; } -/* { dg-final { scan-assembler "movl\t\\\$0," } } */ -/* { dg-final { scan-assembler "movl\t\\\$20," { target { ! ia32 } } } } */ -/* { dg-final { scan-assembler "rep stosq" { target { ! ia32 } } } } */ -/* { dg-final { scan-assembler "movl\t\\\$40," { target ia32} } } */ -/* { dg-final { scan-assembler "rep stosl" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c new file mode 100644 index 00000000000..22ed9ec6601 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**.LFB[0-9]+: +** .cfi_startproc +** xorl %edx, %edx +**.L[0-9]+: +** movl %edx, %eax +** addl \$32, %edx +** movq \(%rsi,%rax\), %r10 +** movq 8\(%rsi,%rax\), %r9 +** movq 16\(%rsi,%rax\), %r8 +** movq 24\(%rsi,%rax\), %rcx +** movq %r10, \(%rdi,%rax\) +** movq %r9, 8\(%rdi,%rax\) +** movq %r8, 16\(%rdi,%rax\) +** movq %rcx, 24\(%rdi,%rax\) +** cmpl \$224, %edx +** jb .L[0-9]+ +** addq %rdx, %rsi +** movq \(%rsi\), %rax +** movq %rax, \(%rdi,%rdx\) +** movq 8\(%rsi\), %rax +** movq %rax, 8\(%rdi,%rdx\) +** movq 16\(%rsi\), %rax +** movq %rax, 16\(%rdi,%rdx\) +** movq 21\(%rsi\), %rax +** movq %rax, 21\(%rdi,%rdx\) +** ret +**... +*/ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 253); +} + +/* { dg-final { scan-assembler-not "rep mov" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c new file mode 100644 index 00000000000..109bd675a51 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-25.c b/gcc/testsuite/gcc.target/i386/memset-strategy-25.c new file mode 100644 index 00000000000..040439d1671 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-25.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**.LFB[0-9]+: +** .cfi_startproc +** xorl %eax, %eax +**.L[0-9]+: +** movl %eax, %edx +** addl \$32, %eax +** movq \$0, \(%rdi,%rdx\) +** movq \$0, 8\(%rdi,%rdx\) +** movq \$0, 16\(%rdi,%rdx\) +** movq \$0, 24\(%rdi,%rdx\) +** cmpl \$224, %eax +** jb .L[0-9]+ +**... +*/ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 253); +} + +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-26.c b/gcc/testsuite/gcc.target/i386/memset-strategy-26.c new file mode 100644 index 00000000000..c53bce52e17 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-26.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +struct foo +{ + char buf[41]; +}; + +void +zero(struct foo *f) +{ + __builtin_memset(f->buf, 0, sizeof(f->buf)); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-27.c b/gcc/testsuite/gcc.target/i386/memset-strategy-27.c new file mode 100644 index 00000000000..685d6e5a5c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-27.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-28.c b/gcc/testsuite/gcc.target/i386/memset-strategy-28.c new file mode 100644 index 00000000000..1d173edf930 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-28.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**.LFB[0-9]+: +** .cfi_startproc +** movq \$0, \(%rdi\) +** movq \$0, 8\(%rdi\) +** movq \$0, 16\(%rdi\) +** movq \$0, 24\(%rdi\) +** movq \$0, 32\(%rdi\) +** movq \$0, 40\(%rdi\) +** movq \$0, 48\(%rdi\) +** movq \$0, 56\(%rdi\) +** movb \$0, 64\(%rdi\) +** ret +**... +*/ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 65); +} + +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-29.c b/gcc/testsuite/gcc.target/i386/memset-strategy-29.c new file mode 100644 index 00000000000..54aa03e6b35 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-29.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**... +**.LFB[0-9]+: +** .cfi_startproc +** xorl %eax, %eax +**.L[0-9]+: +** movl %eax, %edx +** addl \$32, %eax +** movq \$0, \(%rdi,%rdx\) +** movq \$0, 8\(%rdi,%rdx\) +** movq \$0, 16\(%rdi,%rdx\) +** movq \$0, 24\(%rdi,%rdx\) +** cmpl \$64, %eax +** jb .L[0-9]+ +** movq \$0, \(%rdi,%rax\) +** movq \$0, 8\(%rdi,%rax\) +** movb \$0, 16\(%rdi,%rax\) +** ret +**... +*/ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 81); +} + +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-30.c b/gcc/testsuite/gcc.target/i386/memset-strategy-30.c new file mode 100644 index 00000000000..4799adcef5d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-30.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**... +**.LFB[0-9]+: +** .cfi_startproc +** xorl %eax, %eax +**.L[0-9]+: +** movl %eax, %edx +** addl \$32, %eax +** movq \$0, \(%rdi,%rdx\) +** movq \$0, 8\(%rdi,%rdx\) +** movq \$0, 16\(%rdi,%rdx\) +** movq \$0, 24\(%rdi,%rdx\) +** cmpl \$64, %eax +** jb .L[0-9]+ +** movq \$0, 16\(%rdi,%rax\) +** movq \$0, \(%rdi,%rax\) +** movq \$0, 8\(%rdi,%rax\) +** movq \$0, 23\(%rdi,%rax\) +** ret +**... +*/ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 95); +} + +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-31.c b/gcc/testsuite/gcc.target/i386/memset-strategy-31.c new file mode 100644 index 00000000000..b2bb107b353 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-31.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx -msse2" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**foo: +**.LFB[0-9]+: +**... +**.L[0-9]+: +** movl %eax, %edx +** addl \$32, %eax +** movq \$0, \(%rdi,%rdx\) +** movq \$0, 8\(%rdi,%rdx\) +** movq \$0, 16\(%rdi,%rdx\) +** movq \$0, 24\(%rdi,%rdx\) +** cmpl \$224, %eax +** jb .L[0-9]+ +**... +*/ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 254); +} + +/* { dg-final { scan-assembler-not "rep stos" } } */ diff --git a/gcc/testsuite/gcc.target/i386/mvc17.c b/gcc/testsuite/gcc.target/i386/mvc17.c index 8b83c1aecb3..dbf35ac36dc 100644 --- a/gcc/testsuite/gcc.target/i386/mvc17.c +++ b/gcc/testsuite/gcc.target/i386/mvc17.c @@ -1,7 +1,7 @@ /* { dg-do compile } */ /* { dg-require-ifunc "" } */ /* { dg-options "-O2 -march=x86-64" } */ -/* { dg-final { scan-assembler-times "rep mov" 1 } } */ +/* { dg-final { scan-assembler-not "rep mov" } } */ __attribute__((target_clones("default","arch=icelake-server"))) void diff --git a/gcc/testsuite/gcc.target/i386/pr111657-1.c b/gcc/testsuite/gcc.target/i386/pr111657-1.c index a4ba21073f5..fa9f4cfe5c5 100644 --- a/gcc/testsuite/gcc.target/i386/pr111657-1.c +++ b/gcc/testsuite/gcc.target/i386/pr111657-1.c @@ -1,5 +1,26 @@ /* { dg-do assemble } */ /* { dg-options "-O2 -mno-sse -mtune=generic -save-temps" } */ +/* Keep labels and directives ('.cfi_startproc', '.cfi_endproc'). */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } {^\t?\.} } } */ + +/* +**bar: +**... +**.L[0-9]+: +** movl %edx, %eax +** addl \$32, %edx +** movq %gs:m\(%rax\), %r9 +** movq %gs:m\+8\(%rax\), %r8 +** movq %gs:m\+16\(%rax\), %rsi +** movq %gs:m\+24\(%rax\), %rcx +** movq %r9, \(%rdi,%rax\) +** movq %r8, 8\(%rdi,%rax\) +** movq %rsi, 16\(%rdi,%rax\) +** movq %rcx, 24\(%rdi,%rax\) +** cmpl \$224, %edx +** jb .L[0-9]+ +**... +*/ typedef unsigned long uword __attribute__ ((mode (word))); @@ -8,5 +29,4 @@ struct a { uword arr[30]; }; __seg_gs struct a m; void bar (struct a *dst) { *dst = m; } -/* { dg-final { scan-assembler "gs\[ \t\]+rep\[; \t\]+movs(l|q)" { target { ! x32 } } } } */ -/* { dg-final { scan-assembler-not "gs\[ \t\]+rep\[; \t\]+movs(l|q)" { target x32 } } } */ +/* { dg-final { scan-assembler-not "rep movs" } } */ diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c index 4b286671e90..30b82ab695a 100644 --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c @@ -1,5 +1,5 @@ /* { dg-do compile { target { ! ia32 } } } */ -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */ +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */ enum machine_mode { diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c index b0432279644..14db3cee206 100644 --- a/gcc/testsuite/gcc.target/i386/sw-1.c +++ b/gcc/testsuite/gcc.target/i386/sw-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */ +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue -fno-stack-protector" } */ /* { dg-additional-options "-mno-avx" { target ia32 } } */ /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */ -- 2.49.0