On Fri, Apr 23, 2021 at 1:35 AM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> For op_by_pieces operations between two areas of memory on non-strict
> alignment target, add -foverlap-op-by-pieces=[off|on|max-memset] to
> generate overlapping operations to minimize number of operations if it
> is not a stack push which must not overlap.
>
> When operating on LENGTH bytes of memory, -foverlap-op-by-pieces=on
> starts with the widest usable integer size, MAX_SIZE, for LENGTH bytes
> and finishes with the smallest usable integer size, MIN_SIZE, for the
> remaining bytes where MAX_SIZE >= MIN_SIZE.  If MIN_SIZE > the remaining
> bytes, the last operation is performed on MIN_SIZE bytes of overlapping
> memory from the previous operation.
>
> For memset with non-zero byte, -foverlap-op-by-pieces=max-memset generates
> an overlapping fill with MAX_SIZE if the number of the remaining bytes is
> greater than one.
>
> Tested on Linux/x86-64 with both -foverlap-op-by-pieces enabled and
> disabled by default.

Neither the user documentation nor the patch description tells me what
"generate overlapping operations" does.  I _suspect_ it's doing an
offset adjusted read/write of the last piece of a memory region to
avoid doing more than one smaller operations.  Thus for a region
of size 7 and 4-byte granular ops you'd do operations at
offset 0 and 3 rather than one at 0, a two-byte at offset 4 and
a one-byte at offset 7.

When the tail is of power-of-two size you still generate non-overlapping
ops?

For memmove there's a correctness issue so you have to make sure
to first load the last two ops before performing the stores which
increases register pressure.

I'm not sure we want a -f option to control this - not all targets will
be able to support this.  So I'd use a target hook or rather extend
the existing use_by_pieces_infrastructure_p hook with an alternate
return (some flags bitmask I guess).  We do have one extra
target hook, compare_by_pieces_branch_ratio, so by that using
an alternate hook might be also OK.

Adding a -m option in targets that want this user-controllable would
be OK of course.

Richard.

> gcc/
>
>         PR middl-end/90773
>         * common.opt (-foverlap-op-by-pieces): New.
>         * expr.c (by_pieces_ninsns): If -foverlap-op-by-pieces is enabled,
>         round up size and alignment to the widest integer mode for maximum
>         size
>         (op_by_pieces_d): Add get_usable_mode, m_push and
>         m_non_zero_memset.
>         (op_by_pieces_d::op_by_pieces_d): Add 2 bool arguments to
>         initialize m_push and m_non_zero_memset.
>         (op_by_pieces_d::get_usable_mode): New.
>         (op_by_pieces_d::run): Use get_usable_mode to get the largest
>         usable integer mode and generate overlapping operations for
>         -foverlap-op-by-pieces.
>         (PUSHG_P): New.
>         (move_by_pieces_d::move_by_pieces_d): Updated for op_by_pieces_d
>         change.
>         (store_by_pieces_d::store_by_pieces_d): Likewise.
>         (clear_by_pieces): Likewsie.
>         * toplev.c (process_options): Issue an error when
>         -foverlap-op-by-pieces is used for strict alignment target.
>         * doc/invoke.texi: Document -foverlap-op-by-pieces.
>
> gcc/testsuite/
>
>         PR middl-end/90773
>         * g++.dg/pr90773-1.h: New test.
>         * g++.dg/pr90773-1a.C: Likewise.
>         * g++.dg/pr90773-1b.C: Likewise.
>         * g++.dg/pr90773-1c.C: Likewise.
>         * g++.dg/pr90773-1d.C: Likewise.
>         * gcc.target/i386/pr90773-1.c: Likewise.
>         * gcc.target/i386/pr90773-2.c: Likewise.
>         * gcc.target/i386/pr90773-3.c: Likewise.
>         * gcc.target/i386/pr90773-4.c: Likewise.
>         * gcc.target/i386/pr90773-5.c: Likewise.
>         * gcc.target/i386/pr90773-6.c: Likewise.
>         * gcc.target/i386/pr90773-7.c: Likewise.
>         * gcc.target/i386/pr90773-8.c: Likewise.
>         * gcc.target/i386/pr90773-9.c: Likewise.
>         * gcc.target/i386/pr90773-10.c: Likewise.
>         * gcc.target/i386/pr90773-11.c: Likewise.
> ---
>  gcc/common.opt                             |  19 +++
>  gcc/doc/invoke.texi                        |  14 ++
>  gcc/expr.c                                 | 159 ++++++++++++++++-----
>  gcc/testsuite/g++.dg/pr90773-1.h           |  14 ++
>  gcc/testsuite/g++.dg/pr90773-1a.C          |  13 ++
>  gcc/testsuite/g++.dg/pr90773-1b.C          |   5 +
>  gcc/testsuite/g++.dg/pr90773-1c.C          |   5 +
>  gcc/testsuite/g++.dg/pr90773-1d.C          |  19 +++
>  gcc/testsuite/gcc.target/i386/pr90773-1.c  |  17 +++
>  gcc/testsuite/gcc.target/i386/pr90773-10.c |  13 ++
>  gcc/testsuite/gcc.target/i386/pr90773-11.c |  13 ++
>  gcc/testsuite/gcc.target/i386/pr90773-2.c  |  20 +++
>  gcc/testsuite/gcc.target/i386/pr90773-3.c  |  23 +++
>  gcc/testsuite/gcc.target/i386/pr90773-4.c  |  13 ++
>  gcc/testsuite/gcc.target/i386/pr90773-5.c  |  13 ++
>  gcc/testsuite/gcc.target/i386/pr90773-6.c  |  11 ++
>  gcc/testsuite/gcc.target/i386/pr90773-7.c  |  11 ++
>  gcc/testsuite/gcc.target/i386/pr90773-8.c  |  13 ++
>  gcc/testsuite/gcc.target/i386/pr90773-9.c  |  13 ++
>  gcc/toplev.c                               |   8 ++
>  20 files changed, 383 insertions(+), 33 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/pr90773-1.h
>  create mode 100644 gcc/testsuite/g++.dg/pr90773-1a.C
>  create mode 100644 gcc/testsuite/g++.dg/pr90773-1b.C
>  create mode 100644 gcc/testsuite/g++.dg/pr90773-1c.C
>  create mode 100644 gcc/testsuite/g++.dg/pr90773-1d.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-11.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-6.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-7.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-8.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-9.c
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index a75b44ee47e..7f5b38c7810 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2123,6 +2123,25 @@ foptimize-sibling-calls
>  Common Var(flag_optimize_sibling_calls) Optimization
>  Optimize sibling and tail recursive calls.
>
> +foverlap-op-by-pieces
> +Common RejectNegative Alias(foverlap-op-by-pieces=,on)
> +
> +foverlap-op-by-pieces=
> +Common Joined RejectNegative Enum(overlap_op_by_pieces) 
> Var(flag_overlap_op_by_pieces) Init(0)
> +-foverlap-op-by-pieces=[off|on|max-memset]      Generate overlapping 
> operations between two areas of memory.
> +
> +Enum
> +Name(overlap_op_by_pieces) Type(int)
> +
> +EnumValue
> +Enum(overlap_op_by_pieces) String(off) Value(0)
> +
> +EnumValue
> +Enum(overlap_op_by_pieces) String(on) Value(1)
> +
> +EnumValue
> +Enum(overlap_op_by_pieces) String(max-memset) Value(2)
> +
>  fpartial-inlining
>  Common Var(flag_partial_inlining) Optimization
>  Perform partial inlining.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index e98b0962b9f..dbdd1095216 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -530,6 +530,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-spec  -fno-signed-zeros @gol
>  -fno-toplevel-reorder  -fno-trapping-math  -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer  -foptimize-sibling-calls @gol
> +-foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]} @gol
>  -fpartial-inlining  -fpeel-loops  -fpredictive-commoning @gol
>  -fprefetch-loop-arrays @gol
>  -fprofile-correction @gol
> @@ -10360,6 +10361,19 @@ their @code{_FORTIFY_SOURCE} counterparts into 
> faster alternatives.
>
>  Enabled at levels @option{-O2}, @option{-O3}.
>
> +@item -foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]}
> +@opindex -foverlap-op-by-pieces
> +The value @code{on} tells the compiler to generate overlapping
> +operations between two areas of memory by using the largest integer
> +operation to minimize number of operations if it is not a stack push.
> +The value @code{max-memset} tells the compiler to generate an
> +overlapping fill with non-zero byte in the maximum single fill size
> +if the last fill size is greater than one.  The value @code{off}
> +turns off this optimization.
> +
> +This option is only valid for targets which do not require strict
> +alignment.
> +
>  @item -fno-inline
>  @opindex fno-inline
>  @opindex finline
> diff --git a/gcc/expr.c b/gcc/expr.c
> index a0e19465965..375a5497309 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -815,12 +815,27 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned 
> int align,
>                   unsigned int max_size, by_pieces_operation op)
>  {
>    unsigned HOST_WIDE_INT n_insns = 0;
> +  scalar_int_mode mode;
> +
> +  if (flag_overlap_op_by_pieces && op != COMPARE_BY_PIECES)
> +    {
> +      /* NB: Round up L and ALIGN to the widest integer mode for
> +        MAX_SIZE.  */
> +      mode = widest_int_mode_for_size (max_size);
> +      if (optab_handler (mov_optab, mode) != CODE_FOR_nothing)
> +       {
> +         unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode));
> +         if (up > l)
> +           l = up;
> +         align = GET_MODE_ALIGNMENT (mode);
> +       }
> +    }
>
>    align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
>
>    while (max_size > 1 && l > 0)
>      {
> -      scalar_int_mode mode = widest_int_mode_for_size (max_size);
> +      mode = widest_int_mode_for_size (max_size);
>        enum insn_code icode;
>
>        unsigned int modesize = GET_MODE_SIZE (mode);
> @@ -1041,6 +1056,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size)
>
>  class op_by_pieces_d
>  {
> + private:
> +  scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int);
> +
>   protected:
>    pieces_addr m_to, m_from;
>    unsigned HOST_WIDE_INT m_len;
> @@ -1048,6 +1066,10 @@ class op_by_pieces_d
>    unsigned int m_align;
>    unsigned int m_max_size;
>    bool m_reverse;
> +  /* True if this is a stash push.  */
> +  bool m_push;
> +  /* True if this memset with non-zero byte.  */
> +  bool m_non_zero_memset;
>
>    /* Virtual functions, overriden by derived classes for the specific
>       operation.  */
> @@ -1059,7 +1081,7 @@ class op_by_pieces_d
>
>   public:
>    op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
> -                 unsigned HOST_WIDE_INT, unsigned int);
> +                 unsigned HOST_WIDE_INT, unsigned int, bool, bool);
>    void run ();
>  };
>
> @@ -1074,10 +1096,12 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
>                                 by_pieces_constfn from_cfn,
>                                 void *from_cfn_data,
>                                 unsigned HOST_WIDE_INT len,
> -                               unsigned int align)
> +                               unsigned int align, bool push,
> +                               bool non_zero_memset)
>    : m_to (to, to_load, NULL, NULL),
>      m_from (from, from_load, from_cfn, from_cfn_data),
> -    m_len (len), m_max_size (MOVE_MAX_PIECES + 1)
> +    m_len (len), m_max_size (MOVE_MAX_PIECES + 1),
> +    m_push (push), m_non_zero_memset (non_zero_memset)
>  {
>    int toi = m_to.get_addr_inc ();
>    int fromi = m_from.get_addr_inc ();
> @@ -1108,6 +1132,25 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
>    m_align = align;
>  }
>
> +/* This function returns the largest usable integer mode for LEN bytes
> +   whose size is no bigger than size of MODE.  */
> +
> +scalar_int_mode
> +op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len)
> +{
> +  unsigned int size;
> +  do
> +    {
> +      size = GET_MODE_SIZE (mode);
> +      if (len >= size && prepare_mode (mode, m_align))
> +       break;
> +      /* NB: widest_int_mode_for_size checks SIZE > 1.  */
> +      mode = widest_int_mode_for_size (size);
> +    }
> +  while (1);
> +  return mode;
> +}
> +
>  /* This function contains the main loop used for expanding a block
>     operation.  First move what we can in the largest integer mode,
>     then go to successively smaller modes.  For every access, call
> @@ -1116,42 +1159,80 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
>  void
>  op_by_pieces_d::run ()
>  {
> -  while (m_max_size > 1 && m_len > 0)
> +  if (m_len == 0)
> +    return;
> +
> +  /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1.  */
> +  scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
> +  mode = get_usable_mode (mode, m_len);
> +
> +  do
>      {
> -      scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
> +      unsigned int size = GET_MODE_SIZE (mode);
> +      rtx to1 = NULL_RTX, from1;
>
> -      if (prepare_mode (mode, m_align))
> +      while (m_len >= size)
>         {
> -         unsigned int size = GET_MODE_SIZE (mode);
> -         rtx to1 = NULL_RTX, from1;
> +         if (m_reverse)
> +           m_offset -= size;
>
> -         while (m_len >= size)
> -           {
> -             if (m_reverse)
> -               m_offset -= size;
> +         to1 = m_to.adjust (mode, m_offset);
> +         from1 = m_from.adjust (mode, m_offset);
>
> -             to1 = m_to.adjust (mode, m_offset);
> -             from1 = m_from.adjust (mode, m_offset);
> +         m_to.maybe_predec (-(HOST_WIDE_INT)size);
> +         m_from.maybe_predec (-(HOST_WIDE_INT)size);
>
> -             m_to.maybe_predec (-(HOST_WIDE_INT)size);
> -             m_from.maybe_predec (-(HOST_WIDE_INT)size);
> +         generate (to1, from1, mode);
>
> -             generate (to1, from1, mode);
> +         m_to.maybe_postinc (size);
> +         m_from.maybe_postinc (size);
>
> -             m_to.maybe_postinc (size);
> -             m_from.maybe_postinc (size);
> +         if (!m_reverse)
> +           m_offset += size;
>
> -             if (!m_reverse)
> -               m_offset += size;
> +         m_len -= size;
> +       }
>
> -             m_len -= size;
> -           }
> +      finish_mode (mode);
>
> -         finish_mode (mode);
> -       }
> +      if (m_len == 0)
> +       return;
>
> -      m_max_size = GET_MODE_SIZE (mode);
> +      if (!m_push && flag_overlap_op_by_pieces)
> +       {
> +         /* NB: Generate overlapping operations if it is not a stack
> +            push since stack push must not overlap.  */
> +         if (m_len == 1
> +             || !m_non_zero_memset
> +             || flag_overlap_op_by_pieces < 2)
> +           {
> +             /* If the remaining length is 1, this is not memset with
> +                non-zero byte or max-memset isn't enabled, get the
> +                smallest integer mode for M_LEN bytes.  */
> +             mode = smallest_int_mode_for_size (m_len * BITS_PER_UNIT);
> +             mode = get_usable_mode (mode, GET_MODE_SIZE (mode));
> +           }
> +         int gap = GET_MODE_SIZE (mode) - m_len;
> +         if (gap > 0)
> +           {
> +             /* If size of MODE > M_LEN, generate the last operation
> +                in MODE for the remaining bytes with ovelapping memory
> +                from the previois operation.  */
> +             if (m_reverse)
> +               m_offset += gap;
> +             else
> +               m_offset -= gap;
> +             m_len += gap;
> +           }
> +       }
> +      else
> +       {
> +         /* NB: widest_int_mode_for_size checks SIZE > 1.  */
> +         mode = widest_int_mode_for_size (size);
> +         mode = get_usable_mode (mode, m_len);
> +       }
>      }
> +  while (1);
>
>    /* The code above should have handled everything.  */
>    gcc_assert (!m_len);
> @@ -1160,6 +1241,12 @@ op_by_pieces_d::run ()
>  /* Derived class from op_by_pieces_d, providing support for block move
>     operations.  */
>
> +#ifdef PUSH_ROUNDING
> +#define PUSHG_P(to)  ((to) == nullptr)
> +#else
> +#define PUSHG_P(to)  false
> +#endif
> +
>  class move_by_pieces_d : public op_by_pieces_d
>  {
>    insn_gen_fn m_gen_fun;
> @@ -1169,7 +1256,8 @@ class move_by_pieces_d : public op_by_pieces_d
>   public:
>    move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len,
>                     unsigned int align)
> -    : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align)
> +    : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align,
> +                     PUSHG_P (to), false)
>    {
>    }
>    rtx finish_retmode (memop_ret);
> @@ -1263,8 +1351,10 @@ class store_by_pieces_d : public op_by_pieces_d
>
>   public:
>    store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data,
> -                    unsigned HOST_WIDE_INT len, unsigned int align)
> -    : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, align)
> +                    unsigned HOST_WIDE_INT len, unsigned int align,
> +                    bool non_zero_memset)
> +    : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len,
> +                     align, false, non_zero_memset)
>    {
>    }
>    rtx finish_retmode (memop_ret);
> @@ -1411,7 +1501,8 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
>                  memsetp ? SET_BY_PIECES : STORE_BY_PIECES,
>                  optimize_insn_for_speed_p ()));
>
> -  store_by_pieces_d data (to, constfun, constfundata, len, align);
> +  store_by_pieces_d data (to, constfun, constfundata, len, align,
> +                         memsetp);
>    data.run ();
>
>    if (retmode != RETURN_BEGIN)
> @@ -1438,7 +1529,8 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, 
> unsigned int align)
>    if (len == 0)
>      return;
>
> -  store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align);
> +  store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align,
> +                         false);
>    data.run ();
>  }
>
> @@ -1460,7 +1552,8 @@ class compare_by_pieces_d : public op_by_pieces_d
>    compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn,
>                        void *op1_cfn_data, HOST_WIDE_INT len, int align,
>                        rtx_code_label *fail_label)
> -    : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, 
> align)
> +    : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len,
> +                     align, false, false)
>    {
>      m_fail_label = fail_label;
>    }
> diff --git a/gcc/testsuite/g++.dg/pr90773-1.h 
> b/gcc/testsuite/g++.dg/pr90773-1.h
> new file mode 100644
> index 00000000000..abdb78b078b
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pr90773-1.h
> @@ -0,0 +1,14 @@
> +class fixed_wide_int_storage {
> +public:
> +  long val[10];
> +  int len;
> +  fixed_wide_int_storage ()
> +    {
> +      len = sizeof (val) / sizeof (val[0]);
> +      for (int i = 0; i < len; i++)
> +       val[i] = i;
> +    }
> +};
> +
> +extern void foo (fixed_wide_int_storage);
> +extern int record_increment(void);
> diff --git a/gcc/testsuite/g++.dg/pr90773-1a.C 
> b/gcc/testsuite/g++.dg/pr90773-1a.C
> new file mode 100644
> index 00000000000..3ab8d929f74
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pr90773-1a.C
> @@ -0,0 +1,13 @@
> +// { dg-do compile }
> +// { dg-options "-O2" }
> +// { dg-additional-options "-mno-avx -msse2 -mtune=skylake" { target { 
> i?86-*-* x86_64-*-* } } }
> +
> +#include "pr90773-1.h"
> +
> +int
> +record_increment(void)
> +{
> +  fixed_wide_int_storage x;
> +  foo (x);
> +  return 0;
> +}
> diff --git a/gcc/testsuite/g++.dg/pr90773-1b.C 
> b/gcc/testsuite/g++.dg/pr90773-1b.C
> new file mode 100644
> index 00000000000..9713b2dd612
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pr90773-1b.C
> @@ -0,0 +1,5 @@
> +// { dg-do compile }
> +// { dg-options "-O2" }
> +// { dg-additional-options "-mno-avx512f -march=skylake" { target { i?86-*-* 
> x86_64-*-* } } }
> +
> +#include "pr90773-1a.C"
> diff --git a/gcc/testsuite/g++.dg/pr90773-1c.C 
> b/gcc/testsuite/g++.dg/pr90773-1c.C
> new file mode 100644
> index 00000000000..699357a88dc
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pr90773-1c.C
> @@ -0,0 +1,5 @@
> +// { dg-do compile }
> +// { dg-options "-O2" }
> +// { dg-additional-options "-march=skylake-avx512" { target { i?86-*-* 
> x86_64-*-* } } }
> +
> +#include "pr90773-1a.C"
> diff --git a/gcc/testsuite/g++.dg/pr90773-1d.C 
> b/gcc/testsuite/g++.dg/pr90773-1d.C
> new file mode 100644
> index 00000000000..bf9d8543c1b
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pr90773-1d.C
> @@ -0,0 +1,19 @@
> +// { dg-do run }
> +// { dg-options "-O2" }
> +// { dg-additional-options "-march=native" { target { i?86-*-* x86_64-*-* } 
> } }
> +// { dg-additional-sources "pr90773-1a.C" }
> +
> +#include "pr90773-1.h"
> +
> +void
> +foo (fixed_wide_int_storage x)
> +{
> +  for (int i = 0; i < x.len; i++)
> +    if (x.val[i] != i)
> +      __builtin_abort ();
> +}
> +
> +int main ()
> +{
> +  return record_increment ();
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-1.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-1.c
> new file mode 100644
> index 00000000000..86fec27dad0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-1.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 15);
> +}
> +
> +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+11\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-10.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-10.c
> new file mode 100644
> index 00000000000..5985877cc10
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-10.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */
> +
> +extern char *dst;
> +
> +void
> +foo (int c)
> +{
> +  __builtin_memset (dst, c, 5);
> +}
> +
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } 
> } */
> +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } 
> } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-11.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-11.c
> new file mode 100644
> index 00000000000..9bf57aa3a44
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-11.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */
> +
> +extern char *dst;
> +
> +void
> +foo (int c)
> +{
> +  __builtin_memset (dst, c, 6);
> +}
> +
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } 
> } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, 2\\(%\[\^,\]+\\)" 1 } 
> } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-2.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-2.c
> new file mode 100644
> index 00000000000..ebdf9dac6e8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */
> +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */
> +/* { dg-additional-options "-mno-sse" { target ia32 } } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 19);
> +}
> +
> +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-3.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-3.c
> new file mode 100644
> index 00000000000..d876f878f60
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-3.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */
> +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */
> +/* { dg-additional-options "-mno-sse" { target ia32 } } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 31);
> +}
> +
> +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+15\\(%\[\^,\]+\\)," 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+16\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+20\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+24\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+27\\(%\[\^,\]+\\)," 1 { 
> target ia32 } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-4.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-4.c
> new file mode 100644
> index 00000000000..0df1b2fc247
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-4.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" 
> } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 0, 31);
> +}
> +
> +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 
> \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 
> 15\\(%\[\^,\]+\\)" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-5.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-5.c
> new file mode 100644
> index 00000000000..65c9fe88696
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-5.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" 
> } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 0, 21);
> +}
> +
> +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 
> \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\\$0+, 
> 13\\(%\[\^,\]+\\)" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-6.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-6.c
> new file mode 100644
> index 00000000000..0c84d492974
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-6.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" 
> } */
> +
> +void
> +foo (char *dst, char *src)
> +{
> +  __builtin_memcpy (dst, src, 255);
> +}
> +
> +/* { dg-final { scan-assembler-times "movdqu\[\\t 
> \]+\[0-9\]*\\(%\[\^,\]+\\)," 16 } } */
> +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-7.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-7.c
> new file mode 100644
> index 00000000000..732b4d3d992
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-7.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mno-avx -msse2 -mtune=skylake -foverlap-op-by-pieces" 
> } */
> +
> +void
> +foo (char *dst)
> +{
> +  __builtin_memset (dst, 0, 255);
> +}
> +
> +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 
> \[0-9\]*\\(%\[\^,\]+\\)" 16 } } */
> +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-8.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-8.c
> new file mode 100644
> index 00000000000..7ff5ba12daf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-8.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 0, 5);
> +}
> +
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } 
> } */
> +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } 
> } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr90773-9.c 
> b/gcc/testsuite/gcc.target/i386/pr90773-9.c
> new file mode 100644
> index 00000000000..c2fc3ba59a7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr90773-9.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 0, 6);
> +}
> +
> +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } 
> } */
> +/* { dg-final { scan-assembler-times "movw\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } 
> } */
> diff --git a/gcc/toplev.c b/gcc/toplev.c
> index d8cc254adef..23c88c788a2 100644
> --- a/gcc/toplev.c
> +++ b/gcc/toplev.c
> @@ -1323,6 +1323,14 @@ process_options (void)
>         }
>      }
>
> +  if (flag_overlap_op_by_pieces && STRICT_ALIGNMENT)
> +    {
> +      error_at (UNKNOWN_LOCATION,
> +               "%<-foverlap-op-by-pieces%> is not supported for "
> +               "strict alignment target");
> +      flag_overlap_op_by_pieces = 0;
> +    }
> +
>    /* One region RA really helps to decrease the code size.  */
>    if (flag_ira_region == IRA_REGION_AUTODETECT)
>      flag_ira_region
> --
> 2.30.2
>

Reply via email to