On Tue, Jun 24, 2025 at 1:18 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > Here is the v3 patch. It no longer uses "rep mov/stos". Lili, can you > > measure > > its performance impact on Intel and AMD cpus? > > > > The updated generic has > > > > Update memcpy and memset inline strategies for -mtune=generic: > > > > 1. Don't align memory. > This looks OK to me (recent microarchs seems to be good on handling > misaligned accesses in most cases, though we always risk partial memory > stalls then). > > 2. For known sizes, unroll loop with 4 moves or stores per iteration > > without aligning the loop, up to 256 bytes. > > Main reason why limit was bigger is situation where we know the expected > size of the block copied from profile feedback or we have small upper > bound. Calling mempcy means spilling all SSE registers to stack and > increasing integer regsiter pressure, too, which may be counter > productive and I do not think it is caught by the benchmarking done > > I hacked the following micro-benchmark > #include <string.h> > #include <stdlib.h> > #include <stdio.h> > int width = 1024, height = 1024; > char *buffer1; > char *buffer2; > > __attribute__ ((noipa)) > void > copy_triangle (int width, int from, int to, int start, float slope1, float > slope2) > { > for (int i = from; i < to; i++) > { > memcpy (buffer1 + start + (int)((i-from) * slope1) + i * width, > buffer2 + start + (int)((i-from) * slope1) + i * width, > (int)((i-from)*(slope2-slope1))); > } > } > int > main() > { > buffer1 = malloc (width *height); > buffer2 = malloc (width *height); > for (int i = 0; i < 100000; i++) > copy_triangle (width, 0, 255, 0, 0, 1); > } > > which copies triangle of given size from buffer1 to buffer2. With > profile feedback we know the expected size of block and use the table to > inline memcpy. It has two read-only values in xmm registers which needs > to be reloaded from memory if libgcall is used. For two values it seems > that for triangles of size 255 it is already win to use memcpy, for > smaller ones it is better to use inline sequence (can be tested by > copiling -O2 wrt -O2 -minline-all-stringops). > > Of course one can modify the benchmark to use more XMM registers and the > tradeoffs will change, but it is hard to guess regiser pessure at the > expansion time.... Situation is also likely different for kernel due to > mitigations making memcpy call expensive. > > I wonder what kind of benefits you see for going from 8k to 256 bytes > here? I also wonder if inline sequences can be iproved though. It > seems that the offline memcpy for blocks >128 already benefits from > doing vector moves... > > 3. For unknown sizes, use memcpy/memset. > > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with > > unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing > > up to 72 bytes are fully unrolled with 9 stores without SSE. > > I guess it is OK. I sitll think we ought to sovle the code bloat due to > repreated 4-byte $0 immediate, but hope we can do that incrementally. > > > > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues > > with the fixed epilogue size to enable overlapping moves and stores. > > > > gcc/ > > > > PR target/102294 > > PR target/119596 > > PR target/119703 > > PR target/119704 > > * builtins.cc (builtin_memset_gen_str): Make it global. > > * builtins.h (builtin_memset_gen_str): New. > > * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use > > move_by_pieces. > > (expand_setmem_epilogue): Use store_by_pieces. > > (ix86_expand_set_or_cpymem): Pass val_exp, instead of > > vec_promoted_val, to expand_setmem_epilogue. > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > (generic_memset): Likewise. > > (generic_cost): Change CLEAR_RATIO to 10. > > > > gcc/testsuite/ > > > > PR target/102294 > > PR target/119596 > > PR target/119703 > > PR target/119704 > > * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores. > > * gcc.target/i386/auto-init-padding-9.c: Expect loop. > > * gcc.target/i386/memcpy-strategy-12.c: New test. > > * gcc.target/i386/memcpy-strategy-13.c: Likewise. > > * gcc.target/i386/memset-strategy-25.c: Likewise. > > * gcc.target/i386/memset-strategy-26.c: Likewise. > > * gcc.target/i386/memset-strategy-27.c: Likewise. > > * gcc.target/i386/memset-strategy-28.c: Likewise. > > * gcc.target/i386/memset-strategy-29.c: Likewise. > > * gcc.target/i386/memset-strategy-30.c: Likewise. > > * gcc.target/i386/memset-strategy-31.c: Likewise. > > * gcc.target/i386/mvc17.c: Fail with "rep mov" > > * gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail > > with "rep mov". > > * gcc.target/i386/shrink_wrap_1.c: Also pass > > -mmemset-strategy=rep_8byte:-1:align. > > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. > > > > > > -- > > H.J. > > > From bcd7245314d3ba4eb55e9ea2bc0b7d165834f5b6 Mon Sep 17 00:00:00 2001 > > From: "H.J. Lu" <hjl.to...@gmail.com> > > Date: Thu, 18 Mar 2021 18:43:10 -0700 > > Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for > > -mtune=generic > > > > Update memcpy and memset inline strategies for -mtune=generic: > > > > 1. Don't align memory. > > 2. For known sizes, unroll loop with 4 moves or stores per iteration > > without aligning the loop, up to 256 bytes. > > 3. For unknown sizes, use memcpy/memset. > > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with > > unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing > > up to 72 bytes are fully unrolled with 9 stores without SSE. > > > > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues > > with the fixed epilogue size to enable overlapping moves and stores. > > > > gcc/ > > > > PR target/102294 > > PR target/119596 > > PR target/119703 > > PR target/119704 > > * builtins.cc (builtin_memset_gen_str): Make it global. > > * builtins.h (builtin_memset_gen_str): New. > > * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use > > move_by_pieces. > > (expand_setmem_epilogue): Use store_by_pieces. > > (ix86_expand_set_or_cpymem): Pass val_exp, instead of > > vec_promoted_val, to expand_setmem_epilogue. > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > (generic_memset): Likewise. > > (generic_cost): Change CLEAR_RATIO to 10. > > > > gcc/testsuite/ > > > > PR target/102294 > > PR target/119596 > > PR target/119703 > > PR target/119704 > > * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores. > > * gcc.target/i386/auto-init-padding-9.c: Expect loop. > > * gcc.target/i386/memcpy-strategy-12.c: New test. > > * gcc.target/i386/memcpy-strategy-13.c: Likewise. > > * gcc.target/i386/memset-strategy-25.c: Likewise. > > * gcc.target/i386/memset-strategy-26.c: Likewise. > > * gcc.target/i386/memset-strategy-27.c: Likewise. > > * gcc.target/i386/memset-strategy-28.c: Likewise. > > * gcc.target/i386/memset-strategy-29.c: Likewise. > > * gcc.target/i386/memset-strategy-30.c: Likewise. > > * gcc.target/i386/memset-strategy-31.c: Likewise. > > * gcc.target/i386/mvc17.c: Fail with "rep mov" > > * gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail > > with "rep mov". > > * gcc.target/i386/shrink_wrap_1.c: Also pass > > -mmemset-strategy=rep_8byte:-1:align. > > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. > > > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com> > > --- > > gcc/builtins.cc | 2 +- > > gcc/builtins.h | 2 + > > gcc/config/i386/i386-expand.cc | 47 +++++-------------- > > gcc/config/i386/x86-tune-costs.h | 35 +++++++++----- > > .../gcc.target/i386/auto-init-padding-3.c | 7 +-- > > .../gcc.target/i386/auto-init-padding-9.c | 25 ++++++++-- > > .../gcc.target/i386/memcpy-strategy-12.c | 43 +++++++++++++++++ > > .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++ > > .../gcc.target/i386/memset-strategy-25.c | 29 ++++++++++++ > > .../gcc.target/i386/memset-strategy-26.c | 15 ++++++ > > .../gcc.target/i386/memset-strategy-27.c | 11 +++++ > > .../gcc.target/i386/memset-strategy-28.c | 29 ++++++++++++ > > .../gcc.target/i386/memset-strategy-29.c | 34 ++++++++++++++ > > .../gcc.target/i386/memset-strategy-30.c | 35 ++++++++++++++ > > .../gcc.target/i386/memset-strategy-31.c | 28 +++++++++++ > > gcc/testsuite/gcc.target/i386/mvc17.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr111657-1.c | 24 +++++++++- > > gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- > > gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- > > 19 files changed, 322 insertions(+), 61 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-25.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-26.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-27.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-28.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-29.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-30.c > > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-31.c > > > > diff --git a/gcc/builtins.cc b/gcc/builtins.cc > > index 3064bff1ae6..e9c9f8eeab3 100644 > > --- a/gcc/builtins.cc > > +++ b/gcc/builtins.cc > > @@ -4268,7 +4268,7 @@ builtin_memset_read_str (void *data, void *prev, > > 4 bytes wide, return the RTL for 0x01010101*data. If PREV isn't > > nullptr, it has the RTL info from the previous iteration. */ > > > > -static rtx > > +rtx > > builtin_memset_gen_str (void *data, void *prev, > > HOST_WIDE_INT offset ATTRIBUTE_UNUSED, > > fixed_size_mode mode) > > diff --git a/gcc/builtins.h b/gcc/builtins.h > > index 5a553a9c836..b552aee3905 100644 > > --- a/gcc/builtins.h > > +++ b/gcc/builtins.h > > @@ -160,6 +160,8 @@ extern char target_percent_c[3]; > > extern char target_percent_s_newline[4]; > > extern bool target_char_cst_p (tree t, char *p); > > extern rtx get_memory_rtx (tree exp, tree len); > > +extern rtx builtin_memset_gen_str (void *, void *, HOST_WIDE_INT, > > + fixed_size_mode mode); > > > > extern internal_fn associated_internal_fn (combined_fn, tree); > > extern internal_fn associated_internal_fn (tree); > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > > index 82e9f035d11..b7d181b7ffc 100644 > > --- a/gcc/config/i386/i386-expand.cc > > +++ b/gcc/config/i386/i386-expand.cc > > @@ -8221,19 +8221,11 @@ expand_cpymem_epilogue (rtx destmem, rtx srcmem, > > rtx src, dest; > > if (CONST_INT_P (count)) > > { > > - HOST_WIDE_INT countval = INTVAL (count); > > - HOST_WIDE_INT epilogue_size = countval % max_size; > > - int i; > > - > > - /* For now MAX_SIZE should be a power of 2. This assert could be > > - relaxed, but it'll require a bit more complicated epilogue > > - expanding. */ > > - gcc_assert ((max_size & (max_size - 1)) == 0); > > - for (i = max_size; i >= 1; i >>= 1) > > - { > > - if (epilogue_size & i) > > - destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i); > > - } > > + unsigned HOST_WIDE_INT countval = UINTVAL (count); > > + unsigned HOST_WIDE_INT epilogue_size = countval % max_size; > > + unsigned int destalign = MEM_ALIGN (destmem); > > + move_by_pieces (destmem, srcmem, epilogue_size, destalign, > > + RETURN_BEGIN); > > This change is indpeneent of the rest of change, can you commit it > independently? > > return; > > } > > if (max_size > 8) > > @@ -8396,31 +8388,18 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx > > destptr, rtx value, > > > > /* Output code to set at most count & (max_size - 1) bytes starting by > > DEST. */ > > static void > > -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx vec_value, > > - rtx count, int max_size) > > +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, > > + rtx origin_vaule, rtx count, int max_size) > > { > > rtx dest; > > > > if (CONST_INT_P (count)) > > { > > - HOST_WIDE_INT countval = INTVAL (count); > > - HOST_WIDE_INT epilogue_size = countval % max_size; > > - int i; > > - > > - /* For now MAX_SIZE should be a power of 2. This assert could be > > - relaxed, but it'll require a bit more complicated epilogue > > - expanding. */ > > - gcc_assert ((max_size & (max_size - 1)) == 0); > > - for (i = max_size; i >= 1; i >>= 1) > > - { > > - if (epilogue_size & i) > > - { > > - if (vec_value && i > GET_MODE_SIZE (GET_MODE (value))) > > - destmem = emit_memset (destmem, destptr, vec_value, i); > > - else > > - destmem = emit_memset (destmem, destptr, value, i); > > - } > > - } > > + unsigned HOST_WIDE_INT countval = UINTVAL (count); > > + unsigned HOST_WIDE_INT epilogue_size = countval % max_size; > > + unsigned int destalign = MEM_ALIGN (destmem); > > + store_by_pieces (destmem, epilogue_size, builtin_memset_gen_str, > > + origin_vaule, destalign, true, RETURN_BEGIN); > > Likewise here. Note the typo in origin_vaule. > > return; > > } > > if (max_size > 32) > > @@ -9802,7 +9781,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx > > count_exp, rtx val_exp, > > { > > if (issetmem) > > expand_setmem_epilogue (dst, destreg, promoted_val, > > - vec_promoted_val, count_exp, > > + val_exp, count_exp, > > epilogue_size_needed); > > else > > expand_cpymem_epilogue (dst, src, destreg, srcreg, count_exp, > > diff --git a/gcc/config/i386/x86-tune-costs.h > > b/gcc/config/i386/x86-tune-costs.h > > index b08081e37cf..e3d9381594b 100644 > > --- a/gcc/config/i386/x86-tune-costs.h > > +++ b/gcc/config/i386/x86-tune-costs.h > > @@ -4065,19 +4065,32 @@ struct processor_costs shijidadao_cost = { > > > > > > > > -/* Generic should produce code tuned for Core-i7 (and newer chips) > > - and btver1 (and newer chips). */ > > +/* Generic should produce code tuned for Haswell (and newer chips) > > + and znver1 (and newer chips): > > + 1. Don't align memory. > > + 2. For known sizes, unroll loop with 4 moves or stores per iteration > > + without aligning the loop, up to 256 bytes. > > + 3. For unknown sizes, use memcpy/memset. > This is not what the table does. It says that for sizes expected <=256 > use unrolled loop, for rest use memcpy. > > For known sizes <256 STORE_BY_PIECES will prevail and we won't use loop. > > The patch looks reaosnable to me now. Thanks for patience with this - > doing strinops right in all situations is bit of a complex problem... >
I am checking it in. Thanks. -- H.J.