> Here is the v3 patch.  It no longer uses "rep mov/stos".   Lili, can you 
> measure
> its performance impact on Intel and AMD cpus?
> 
> The updated generic has
> 
> Update memcpy and memset inline strategies for -mtune=generic:
> 
> 1. Don't align memory.
This looks OK to me (recent microarchs seems to be good on handling
misaligned accesses in most cases, though we always risk partial memory
stalls then).
> 2. For known sizes, unroll loop with 4 moves or stores per iteration
>    without aligning the loop, up to 256 bytes.

Main reason why limit was bigger is situation where we know the expected
size of the block copied from profile feedback or we have small upper
bound.  Calling mempcy means spilling all SSE registers to stack and
increasing integer regsiter pressure, too, which may be counter
productive and I do not think it is caught by the benchmarking done

I hacked the following micro-benchmark
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int width = 1024, height = 1024;
char *buffer1;
char *buffer2;

__attribute__ ((noipa))
void
copy_triangle (int width, int from, int to, int start, float slope1, float 
slope2)
{
        for (int i = from; i < to; i++)
        {
          memcpy (buffer1 + start + (int)((i-from) * slope1) + i * width,
                  buffer2 + start + (int)((i-from) * slope1) + i * width,
                  (int)((i-from)*(slope2-slope1)));
        }
}
int
main()
{
        buffer1 = malloc (width *height);
        buffer2 = malloc (width *height);
        for (int i = 0; i < 100000; i++)
                copy_triangle (width, 0, 255, 0, 0, 1);
}

which copies triangle of given size from buffer1 to buffer2.  With
profile feedback we know the expected size of block and use the table to
inline memcpy.  It has two read-only values in xmm registers which needs
to be reloaded from memory if libgcall is used. For two values it seems
that for triangles of size 255 it is already win to use memcpy, for
smaller ones it is better to use inline sequence (can be tested by
copiling -O2 wrt -O2 -minline-all-stringops).

Of course one can modify the benchmark to use more XMM registers and the
tradeoffs will change, but it is hard to guess regiser pessure at the
expansion time....  Situation is also likely different for kernel due to
mitigations making memcpy call expensive.

I wonder what kind of benefits you see for going from 8k to 256 bytes
here?  I also wonder if inline sequences can be iproved though.  It
seems that the offline memcpy for blocks >128 already benefits from
doing vector moves...
> 3. For unknown sizes, use memcpy/memset.
> 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
>    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
>    up to 72 bytes are fully unrolled with 9 stores without SSE.

I guess it is OK.  I sitll think we ought to sovle the code bloat due to
repreated 4-byte $0 immediate, but hope we can do that incrementally.
> 
> Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> with the fixed epilogue size to enable overlapping moves and stores.
> 
> gcc/
> 
> PR target/102294
> PR target/119596
> PR target/119703
> PR target/119704
> * builtins.cc (builtin_memset_gen_str): Make it global.
> * builtins.h (builtin_memset_gen_str): New.
> * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
> move_by_pieces.
> (expand_setmem_epilogue): Use store_by_pieces.
> (ix86_expand_set_or_cpymem): Pass val_exp, instead of
> vec_promoted_val, to expand_setmem_epilogue.
> * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> (generic_memset): Likewise.
> (generic_cost): Change CLEAR_RATIO to 10.
> 
> gcc/testsuite/
> 
> PR target/102294
> PR target/119596
> PR target/119703
> PR target/119704
> * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
> * gcc.target/i386/auto-init-padding-9.c: Expect loop.
> * gcc.target/i386/memcpy-strategy-12.c: New test.
> * gcc.target/i386/memcpy-strategy-13.c: Likewise.
> * gcc.target/i386/memset-strategy-25.c: Likewise.
> * gcc.target/i386/memset-strategy-26.c: Likewise.
> * gcc.target/i386/memset-strategy-27.c: Likewise.
> * gcc.target/i386/memset-strategy-28.c: Likewise.
> * gcc.target/i386/memset-strategy-29.c: Likewise.
> * gcc.target/i386/memset-strategy-30.c: Likewise.
> * gcc.target/i386/memset-strategy-31.c: Likewise.
> * gcc.target/i386/mvc17.c: Fail with "rep mov"
> * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
> with "rep mov".
> * gcc.target/i386/shrink_wrap_1.c: Also pass
> -mmemset-strategy=rep_8byte:-1:align.
> * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> 
> 
> -- 
> H.J.

> From bcd7245314d3ba4eb55e9ea2bc0b7d165834f5b6 Mon Sep 17 00:00:00 2001
> From: "H.J. Lu" <hjl.to...@gmail.com>
> Date: Thu, 18 Mar 2021 18:43:10 -0700
> Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for
>  -mtune=generic
> 
> Update memcpy and memset inline strategies for -mtune=generic:
> 
> 1. Don't align memory.
> 2. For known sizes, unroll loop with 4 moves or stores per iteration
>    without aligning the loop, up to 256 bytes.
> 3. For unknown sizes, use memcpy/memset.
> 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
>    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
>    up to 72 bytes are fully unrolled with 9 stores without SSE.
> 
> Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> with the fixed epilogue size to enable overlapping moves and stores.
> 
> gcc/
> 
>       PR target/102294
>       PR target/119596
>       PR target/119703
>       PR target/119704
>       * builtins.cc (builtin_memset_gen_str): Make it global.
>       * builtins.h (builtin_memset_gen_str): New.
>       * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
>       move_by_pieces.
>       (expand_setmem_epilogue): Use store_by_pieces.
>       (ix86_expand_set_or_cpymem): Pass val_exp, instead of
>       vec_promoted_val, to expand_setmem_epilogue.
>       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
>       (generic_memset): Likewise.
>       (generic_cost): Change CLEAR_RATIO to 10.
> 
> gcc/testsuite/
> 
>       PR target/102294
>       PR target/119596
>       PR target/119703
>       PR target/119704
>       * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
>       * gcc.target/i386/auto-init-padding-9.c: Expect loop.
>       * gcc.target/i386/memcpy-strategy-12.c: New test.
>       * gcc.target/i386/memcpy-strategy-13.c: Likewise.
>       * gcc.target/i386/memset-strategy-25.c: Likewise.
>       * gcc.target/i386/memset-strategy-26.c: Likewise.
>       * gcc.target/i386/memset-strategy-27.c: Likewise.
>       * gcc.target/i386/memset-strategy-28.c: Likewise.
>       * gcc.target/i386/memset-strategy-29.c: Likewise.
>       * gcc.target/i386/memset-strategy-30.c: Likewise.
>       * gcc.target/i386/memset-strategy-31.c: Likewise.
>       * gcc.target/i386/mvc17.c: Fail with "rep mov"
>       * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail
>       with "rep mov".
>       * gcc.target/i386/shrink_wrap_1.c: Also pass
>       -mmemset-strategy=rep_8byte:-1:align.
>       * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> 
> Signed-off-by: H.J. Lu <hjl.to...@gmail.com>
> ---
>  gcc/builtins.cc                               |  2 +-
>  gcc/builtins.h                                |  2 +
>  gcc/config/i386/i386-expand.cc                | 47 +++++--------------
>  gcc/config/i386/x86-tune-costs.h              | 35 +++++++++-----
>  .../gcc.target/i386/auto-init-padding-3.c     |  7 +--
>  .../gcc.target/i386/auto-init-padding-9.c     | 25 ++++++++--
>  .../gcc.target/i386/memcpy-strategy-12.c      | 43 +++++++++++++++++
>  .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++
>  .../gcc.target/i386/memset-strategy-25.c      | 29 ++++++++++++
>  .../gcc.target/i386/memset-strategy-26.c      | 15 ++++++
>  .../gcc.target/i386/memset-strategy-27.c      | 11 +++++
>  .../gcc.target/i386/memset-strategy-28.c      | 29 ++++++++++++
>  .../gcc.target/i386/memset-strategy-29.c      | 34 ++++++++++++++
>  .../gcc.target/i386/memset-strategy-30.c      | 35 ++++++++++++++
>  .../gcc.target/i386/memset-strategy-31.c      | 28 +++++++++++
>  gcc/testsuite/gcc.target/i386/mvc17.c         |  2 +-
>  gcc/testsuite/gcc.target/i386/pr111657-1.c    | 24 +++++++++-
>  gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
>  gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
>  19 files changed, 322 insertions(+), 61 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-25.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-26.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-27.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-28.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-29.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-30.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-31.c
> 
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 3064bff1ae6..e9c9f8eeab3 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -4268,7 +4268,7 @@ builtin_memset_read_str (void *data, void *prev,
>     4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
>     nullptr, it has the RTL info from the previous iteration.  */
>  
> -static rtx
> +rtx
>  builtin_memset_gen_str (void *data, void *prev,
>                       HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
>                       fixed_size_mode mode)
> diff --git a/gcc/builtins.h b/gcc/builtins.h
> index 5a553a9c836..b552aee3905 100644
> --- a/gcc/builtins.h
> +++ b/gcc/builtins.h
> @@ -160,6 +160,8 @@ extern char target_percent_c[3];
>  extern char target_percent_s_newline[4];
>  extern bool target_char_cst_p (tree t, char *p);
>  extern rtx get_memory_rtx (tree exp, tree len);
> +extern rtx builtin_memset_gen_str (void *, void *, HOST_WIDE_INT,
> +                                fixed_size_mode mode);
>  
>  extern internal_fn associated_internal_fn (combined_fn, tree);
>  extern internal_fn associated_internal_fn (tree);
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 82e9f035d11..b7d181b7ffc 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -8221,19 +8221,11 @@ expand_cpymem_epilogue (rtx destmem, rtx srcmem,
>    rtx src, dest;
>    if (CONST_INT_P (count))
>      {
> -      HOST_WIDE_INT countval = INTVAL (count);
> -      HOST_WIDE_INT epilogue_size = countval % max_size;
> -      int i;
> -
> -      /* For now MAX_SIZE should be a power of 2.  This assert could be
> -      relaxed, but it'll require a bit more complicated epilogue
> -      expanding.  */
> -      gcc_assert ((max_size & (max_size - 1)) == 0);
> -      for (i = max_size; i >= 1; i >>= 1)
> -     {
> -       if (epilogue_size & i)
> -         destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
> -     }
> +      unsigned HOST_WIDE_INT countval = UINTVAL (count);
> +      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
> +      unsigned int destalign = MEM_ALIGN (destmem);
> +      move_by_pieces (destmem, srcmem, epilogue_size, destalign,
> +                   RETURN_BEGIN);

This change is indpeneent of the rest of change, can you commit it
independently?
>        return;
>      }
>    if (max_size > 8)
> @@ -8396,31 +8388,18 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx 
> destptr, rtx value,
>  
>  /* Output code to set at most count & (max_size - 1) bytes starting by DEST. 
>  */
>  static void
> -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx vec_value,
> -                     rtx count, int max_size)
> +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value,
> +                     rtx origin_vaule, rtx count, int max_size)
>  {
>    rtx dest;
>  
>    if (CONST_INT_P (count))
>      {
> -      HOST_WIDE_INT countval = INTVAL (count);
> -      HOST_WIDE_INT epilogue_size = countval % max_size;
> -      int i;
> -
> -      /* For now MAX_SIZE should be a power of 2.  This assert could be
> -      relaxed, but it'll require a bit more complicated epilogue
> -      expanding.  */
> -      gcc_assert ((max_size & (max_size - 1)) == 0);
> -      for (i = max_size; i >= 1; i >>= 1)
> -     {
> -       if (epilogue_size & i)
> -         {
> -           if (vec_value && i > GET_MODE_SIZE (GET_MODE (value)))
> -             destmem = emit_memset (destmem, destptr, vec_value, i);
> -           else
> -             destmem = emit_memset (destmem, destptr, value, i);
> -         }
> -     }
> +      unsigned HOST_WIDE_INT countval = UINTVAL (count);
> +      unsigned HOST_WIDE_INT epilogue_size = countval % max_size;
> +      unsigned int destalign = MEM_ALIGN (destmem);
> +      store_by_pieces (destmem, epilogue_size, builtin_memset_gen_str,
> +                    origin_vaule, destalign, true, RETURN_BEGIN);

Likewise here. Note the typo in origin_vaule.
>        return;
>      }
>    if (max_size > 32)
> @@ -9802,7 +9781,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx 
> count_exp, rtx val_exp,
>       {
>         if (issetmem)
>           expand_setmem_epilogue (dst, destreg, promoted_val,
> -                                 vec_promoted_val, count_exp,
> +                                 val_exp, count_exp,
>                                   epilogue_size_needed);
>         else
>           expand_cpymem_epilogue (dst, src, destreg, srcreg, count_exp,
> diff --git a/gcc/config/i386/x86-tune-costs.h 
> b/gcc/config/i386/x86-tune-costs.h
> index b08081e37cf..e3d9381594b 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -4065,19 +4065,32 @@ struct processor_costs shijidadao_cost = {
>  
>  
>  
> -/* Generic should produce code tuned for Core-i7 (and newer chips)
> -   and btver1 (and newer chips).  */
> +/* Generic should produce code tuned for Haswell (and newer chips)
> +   and znver1 (and newer chips):
> +   1. Don't align memory.
> +   2. For known sizes, unroll loop with 4 moves or stores per iteration
> +      without aligning the loop, up to 256 bytes.
> +   3. For unknown sizes, use memcpy/memset.
This is not what the table does. It says that for sizes expected <=256
use unrolled loop, for rest use memcpy.

For known sizes <256 STORE_BY_PIECES will prevail and we won't use loop.

The patch looks reaosnable to me now. Thanks for patience with this -
doing strinops right in all situations is bit of a complex problem...

Honza

Reply via email to