Re: [PATCH v3] x86: Update memcpy/memset inline strategies for -mtune=generic

H.J. Lu Mon, 23 Jun 2025 22:28:31 -0700

On Tue, Jun 24, 2025 at 1:18 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > Here is the v3 patch.  It no longer uses "rep mov/stos".   Lili, can you 
> > measure
> > its performance impact on Intel and AMD cpus?
> >
> > The updated generic has
> >
> > Update memcpy and memset inline strategies for -mtune=generic:
> >
> > 1. Don't align memory.
> This looks OK to me (recent microarchs seems to be good on handling
> misaligned accesses in most cases, though we always risk partial memory
> stalls then).
> > 2. For known sizes, unroll loop with 4 moves or stores per iteration
> >    without aligning the loop, up to 256 bytes.
>
> Main reason why limit was bigger is situation where we know the expected
> size of the block copied from profile feedback or we have small upper
> bound.  Calling mempcy means spilling all SSE registers to stack and
> increasing integer regsiter pressure, too, which may be counter
> productive and I do not think it is caught by the benchmarking done
>
> I hacked the following micro-benchmark
> #include <string.h>
> #include <stdlib.h>
> #include <stdio.h>
> int width = 1024, height = 1024;
> char *buffer1;
> char *buffer2;
>
> __attribute__ ((noipa))
> void
> copy_triangle (int width, int from, int to, int start, float slope1, float 
> slope2)
> {
>         for (int i = from; i < to; i++)
>         {
>           memcpy (buffer1 + start + (int)((i-from) * slope1) + i * width,
>                   buffer2 + start + (int)((i-from) * slope1) + i * width,
>                   (int)((i-from)*(slope2-slope1)));
>         }
> }
> int
> main()
> {
>         buffer1 = malloc (width *height);
>         buffer2 = malloc (width *height);
>         for (int i = 0; i < 100000; i++)
>                 copy_triangle (width, 0, 255, 0, 0, 1);
> }
>
> which copies triangle of given size from buffer1 to buffer2.  With
> profile feedback we know the expected size of block and use the table to
> inline memcpy.  It has two read-only values in xmm registers which needs
> to be reloaded from memory if libgcall is used. For two values it seems
> that for triangles of size 255 it is already win to use memcpy, for
> smaller ones it is better to use inline sequence (can be tested by
> copiling -O2 wrt -O2 -minline-all-stringops).
>
> Of course one can modify the benchmark to use more XMM registers and the
> tradeoffs will change, but it is hard to guess regiser pessure at the
> expansion time....  Situation is also likely different for kernel due to
> mitigations making memcpy call expensive.
>
> I wonder what kind of benefits you see for going from 8k to 256 bytes
> here?  I also wonder if inline sequences can be iproved though.  It
> seems that the offline memcpy for blocks >128 already benefits from
> doing vector moves...
> > 3. For unknown sizes, use memcpy/memset.
> > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
> >    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
> >    up to 72 bytes are fully unrolled with 9 stores without SSE.
>
> I guess it is OK.  I sitll think we ought to sovle the code bloat due to
> repreated 4-byte $0 immediate, but hope we can do that incrementally.
> >
> > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> > with the fixed epilogue size to enable overlapping moves and stores.
> >
> > gcc/
> >
> > PR target/102294
> > PR target/119596
> > PR target/119703
> > PR target/119704
> > * builtins.cc (builtin_memset_gen_str): Make it global.
> > * builtins.h (builtin_memset_gen_str): New.
> > * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
> > move_by_pieces.
> > (expand_setmem_epilogue): Use store_by_pieces.
> > (ix86_expand_set_or_cpymem): Pass val_exp, instead of
> > vec_promoted_val, to expand_setmem_epilogue.
> > * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > (generic_memset): Likewise.
> > (generic_cost): Change CLEAR_RATIO to 10.
> >
> > gcc/testsuite/
> >
> > PR target/102294
> > PR target/119596
> > PR target/119703
> > PR target/119704
> > * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
> > * gcc.target/i386/auto-init-padding-9.c: Expect loop.
> > * gcc.target/i386/memcpy-strategy-12.c: New test.
> > * gcc.target/i386/memcpy-strategy-13.c: Likewise.
> > * gcc.target/i386/memset-strategy-25.c: Likewise.
> > * gcc.target/i386/memset-strategy-26.c: Likewise.
> > * gcc.target/i386/memset-strategy-27.c: Likewise.
> > * gcc.target/i386/memset-strategy-28.c: Likewise.
> > * gcc.target/i386/memset-strategy-29.c: Likewise.
> > * gcc.target/i386/memset-strategy-30.c: Likewise.
> > * gcc.target/i386/memset-strategy-31.c: Likewise.
> > * gcc.target/i386/mvc17.c: Fail with "rep mov"
> > * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
> > with "rep mov".
> > * gcc.target/i386/shrink_wrap_1.c: Also pass
> > -mmemset-strategy=rep_8byte:-1:align.
> > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> >
> >
> > --
> > H.J.
>
> > From bcd7245314d3ba4eb55e9ea2bc0b7d165834f5b6 Mon Sep 17 00:00:00 2001
> > From: "H.J. Lu" <hjl.to...@gmail.com>
> > Date: Thu, 18 Mar 2021 18:43:10 -0700
> > Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for
> >  -mtune=generic
> >
> > Update memcpy and memset inline strategies for -mtune=generic:
> >
> > 1. Don't align memory.
> > 2. For known sizes, unroll loop with 4 moves or stores per iteration
> >    without aligning the loop, up to 256 bytes.
> > 3. For unknown sizes, use memcpy/memset.
> > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
> >    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
> >    up to 72 bytes are fully unrolled with 9 stores without SSE.
> >
> > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> > with the fixed epilogue size to enable overlapping moves and stores.
> >
> > gcc/
> >
> >       PR target/102294
> >       PR target/119596
> >       PR target/119703
> >       PR target/119704
> >       * builtins.cc (builtin_memset_gen_str): Make it global.
> >       * builtins.h (builtin_memset_gen_str): New.
> >       * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
> >       move_by_pieces.
> >       (expand_setmem_epilogue): Use store_by_pieces.
> >       (ix86_expand_set_or_cpymem): Pass val_exp, instead of
> >       vec_promoted_val, to expand_setmem_epilogue.
> >       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> >       (generic_memset): Likewise.
> >       (generic_cost): Change CLEAR_RATIO to 10.
> >
> > gcc/testsuite/
> >
> >       PR target/102294
> >       PR target/119596
> >       PR target/119703
> >       PR target/119704
> >       * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
> >       * gcc.target/i386/auto-init-padding-9.c: Expect loop.
> >       * gcc.target/i386/memcpy-strategy-12.c: New test.
> >       * gcc.target/i386/memcpy-strategy-13.c: Likewise.
> >       * gcc.target/i386/memset-strategy-25.c: Likewise.
> >       * gcc.target/i386/memset-strategy-26.c: Likewise.
> >       * gcc.target/i386/memset-strategy-27.c: Likewise.
> >       * gcc.target/i386/memset-strategy-28.c: Likewise.
> >       * gcc.target/i386/memset-strategy-29.c: Likewise.
> >       * gcc.target/i386/memset-strategy-30.c: Likewise.
> >       * gcc.target/i386/memset-strategy-31.c: Likewise.
> >       * gcc.target/i386/mvc17.c: Fail with "rep mov"
> >       * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
> >       with "rep mov".
> >       * gcc.target/i386/shrink_wrap_1.c: Also pass
> >       -mmemset-strategy=rep_8byte:-1:align.
> >       * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> >
> > Signed-off-by: H.J. Lu <hjl.to...@gmail.com>
> > ---
> >  gcc/builtins.cc                               |  2 +-
> >  gcc/builtins.h                                |  2 +
> >  gcc/config/i386/i386-expand.cc                | 47 +++++--------------
> >  gcc/config/i386/x86-tune-costs.h              | 35 +++++++++-----
> >  .../gcc.target/i386/auto-init-padding-3.c     |  7 +--
> >  .../gcc.target/i386/auto-init-padding-9.c     | 25 ++++++++--
> >  .../gcc.target/i386/memcpy-strategy-12.c      | 43 +++++++++++++++++
> >  .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++
> >  .../gcc.target/i386/memset-strategy-25.c      | 29 ++++++++++++
> >  .../gcc.target/i386/memset-strategy-26.c      | 15 ++++++
> >  .../gcc.target/i386/memset-strategy-27.c      | 11 +++++
> >  .../gcc.target/i386/memset-strategy-28.c      | 29 ++++++++++++
> >  .../gcc.target/i386/memset-strategy-29.c      | 34 ++++++++++++++
> >  .../gcc.target/i386/memset-strategy-30.c      | 35 ++++++++++++++
> >  .../gcc.target/i386/memset-strategy-31.c      | 28 +++++++++++
> >  gcc/testsuite/gcc.target/i386/mvc17.c         |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr111657-1.c    | 24 +++++++++-
> >  gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
> >  19 files changed, 322 insertions(+), 61 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-25.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-26.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-27.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-28.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-29.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-30.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-31.c
> >
> > diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> > index 3064bff1ae6..e9c9f8eeab3 100644
> > --- a/gcc/builtins.cc
> > +++ b/gcc/builtins.cc
> > @@ -4268,7 +4268,7 @@ builtin_memset_read_str (void *data, void *prev,
> >     4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
> >     nullptr, it has the RTL info from the previous iteration.  */
> >
> > -static rtx
> > +rtx
> >  builtin_memset_gen_str (void *data, void *prev,
> >                       HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                       fixed_size_mode mode)
> > diff --git a/gcc/builtins.h b/gcc/builtins.h
> > index 5a553a9c836..b552aee3905 100644
> > --- a/gcc/builtins.h
> > +++ b/gcc/builtins.h
> > @@ -160,6 +160,8 @@ extern char target_percent_c[3];
> >  extern char target_percent_s_newline[4];
> >  extern bool target_char_cst_p (tree t, char *p);
> >  extern rtx get_memory_rtx (tree exp, tree len);
> > +extern rtx builtin_memset_gen_str (void *, void *, HOST_WIDE_INT,
> > +                                fixed_size_mode mode);
> >
> >  extern internal_fn associated_internal_fn (combined_fn, tree);
> >  extern internal_fn associated_internal_fn (tree);
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 82e9f035d11..b7d181b7ffc 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -8221,19 +8221,11 @@ expand_cpymem_epilogue (rtx destmem, rtx srcmem,
> >    rtx src, dest;
> >    if (CONST_INT_P (count))
> >      {
> > -      HOST_WIDE_INT countval = INTVAL (count);
> > -      HOST_WIDE_INT epilogue_size = countval % max_size;
> > -      int i;
> > -
> > -      /* For now MAX_SIZE should be a power of 2.  This assert could be
> > -      relaxed, but it'll require a bit more complicated epilogue
> > -      expanding.  */
> > -      gcc_assert ((max_size & (max_size - 1)) == 0);
> > -      for (i = max_size; i >= 1; i >>= 1)
> > -     {
> > -       if (epilogue_size & i)
> > -         destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
> > -     }
> > +      unsigned HOST_WIDE_INT countval = UINTVAL (count);
> > +      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
> > +      unsigned int destalign = MEM_ALIGN (destmem);
> > +      move_by_pieces (destmem, srcmem, epilogue_size, destalign,
> > +                   RETURN_BEGIN);
>
> This change is indpeneent of the rest of change, can you commit it
> independently?
> >        return;
> >      }
> >    if (max_size > 8)
> > @@ -8396,31 +8388,18 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx 
> > destptr, rtx value,
> >
> >  /* Output code to set at most count & (max_size - 1) bytes starting by 
> > DEST.  */
> >  static void
> > -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx vec_value,
> > -                     rtx count, int max_size)
> > +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value,
> > +                     rtx origin_vaule, rtx count, int max_size)
> >  {
> >    rtx dest;
> >
> >    if (CONST_INT_P (count))
> >      {
> > -      HOST_WIDE_INT countval = INTVAL (count);
> > -      HOST_WIDE_INT epilogue_size = countval % max_size;
> > -      int i;
> > -
> > -      /* For now MAX_SIZE should be a power of 2.  This assert could be
> > -      relaxed, but it'll require a bit more complicated epilogue
> > -      expanding.  */
> > -      gcc_assert ((max_size & (max_size - 1)) == 0);
> > -      for (i = max_size; i >= 1; i >>= 1)
> > -     {
> > -       if (epilogue_size & i)
> > -         {
> > -           if (vec_value && i > GET_MODE_SIZE (GET_MODE (value)))
> > -             destmem = emit_memset (destmem, destptr, vec_value, i);
> > -           else
> > -             destmem = emit_memset (destmem, destptr, value, i);
> > -         }
> > -     }
> > +      unsigned HOST_WIDE_INT countval = UINTVAL (count);
> > +      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
> > +      unsigned int destalign = MEM_ALIGN (destmem);
> > +      store_by_pieces (destmem, epilogue_size, builtin_memset_gen_str,
> > +                    origin_vaule, destalign, true, RETURN_BEGIN);
>
> Likewise here. Note the typo in origin_vaule.
> >        return;
> >      }
> >    if (max_size > 32)
> > @@ -9802,7 +9781,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx 
> > count_exp, rtx val_exp,
> >       {
> >         if (issetmem)
> >           expand_setmem_epilogue (dst, destreg, promoted_val,
> > -                                 vec_promoted_val, count_exp,
> > +                                 val_exp, count_exp,
> >                                   epilogue_size_needed);
> >         else
> >           expand_cpymem_epilogue (dst, src, destreg, srcreg, count_exp,
> > diff --git a/gcc/config/i386/x86-tune-costs.h 
> > b/gcc/config/i386/x86-tune-costs.h
> > index b08081e37cf..e3d9381594b 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -4065,19 +4065,32 @@ struct processor_costs shijidadao_cost = {
> >
> >
> >
> > -/* Generic should produce code tuned for Core-i7 (and newer chips)
> > -   and btver1 (and newer chips).  */
> > +/* Generic should produce code tuned for Haswell (and newer chips)
> > +   and znver1 (and newer chips):
> > +   1. Don't align memory.
> > +   2. For known sizes, unroll loop with 4 moves or stores per iteration
> > +      without aligning the loop, up to 256 bytes.
> > +   3. For unknown sizes, use memcpy/memset.
> This is not what the table does. It says that for sizes expected <=256
> use unrolled loop, for rest use memcpy.
>
> For known sizes <256 STORE_BY_PIECES will prevail and we won't use loop.
>
> The patch looks reaosnable to me now. Thanks for patience with this -
> doing strinops right in all situations is bit of a complex problem...
>


I am checking it in.

Thanks.

-- 
H.J.

Re: [PATCH v3] x86: Update memcpy/memset inline strategies for -mtune=generic

Reply via email to