On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <l...@redhat.com> wrote:
>
> This is the first of the 4 part patchkit to address deficiencies in our DSE
> implementation.
>
>
> This patch addresses the P2 regression 33562 which has been a low priority
> regression since gcc-4.3.  To summarize, DSE no longer has the ability to
> detect an aggregate store as dead if subsequent stores are done in a
> piecemeal fashion.
>
> I originally tackled this by changing how we lower complex objects. That was
> sufficient to address 33562, but was reasonably rejected.
>
> This version attacks the problem by improving DSE to track stores to memory
> at a byte level.  That allows us to determine if a series of stores
> completely covers an earlier store (thus making the earlier store dead).
>
> A useful side effect of this is we can detect when parts of a store are dead
> and potentially rewrite the store.  This patch implements that for complex
> object initializations.  While not strictly part of 33562, it's so closely
> related that I felt it belongs as part of this patch.
>
> This originally limited the size of the tracked memory space to 64 bytes.  I
> bumped the limit after working through the CONSTRUCTOR and mem* trimming
> patches.  The 256 byte limit is still fairly arbitrary and I wouldn't lose
> sleep if we throttled back to 64 or 128 bytes.
>
> Later patches in the kit will build upon this patch.  So if pieces look like
> skeleton code, that's because it is.
>
>
> Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?

Apart from what Trevor says about using sbitmaps (try to avoid the initial
zeroing please) and the missed freeing (you can use auto_[s]bitmap?)
some comments below.

>         PR tree-optimization/33562
>         * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
>         * tree-ssa-dse.c: Include params.h.
>         (initialize_ao_ref_for_dse): New, partially extracted from
>         dse_optimize_stmt.
>         (valid_io_ref_for_dse): New.
>         (clear_bytes_written_by, trim_complex_store): Likewise.
>         (trim_partially_dead_store): Likewise.
>         (dse_partially_dead_store_p): Track what bytes were originally
> stored
>         into memory by the statement as well as the subset of bytes that
>         are still live.   If we "fail", but have identified the store as
>         partially dead, try to rewrite it to store fewer bytes of data.
>         Exit the main loop if we find a full kill as a single statement
>         or via group of statements.
>         (dse_optimize_stmt): Use initialize_ao_ref_for_dse.
>
>
>         * gcc.dg/tree-ssa/complex-4.c: No longer xfailed.
>         * gcc.dg/tree-ssa/complex-5.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-9.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-dse-19.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-20.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-21.c: Likewise.
>
> diff --git a/gcc/params.def b/gcc/params.def
> index 50f75a7..ddc3d65 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER,
>          "Average number of iterations of a loop.",
>          10, 1, 0)
>
> +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
> +        "dse-max-object-size",
> +        "Maximum size (in bytes) of objects tracked by dead store
> elimination.",
> +        256, 0, 0)
> +
>  DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE,
>          "scev-max-expr-size",
>          "Bound on size of expressions used in the scalar evolutions
> analyzer.",
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> index 87a2638..3155741 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> @@ -10,4 +10,4 @@ int f(void)
>    return g(&t);
>  }
>
> -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> index e2cd403..e6d027f 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> @@ -8,4 +8,4 @@ int f(void)
>   __imag__ t = 2;
>  }
>
> -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> new file mode 100644
> index 0000000..92b2df8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +int g(_Complex int*);
> +int f(void)
> +{
> +  _Complex int t = 0;
> +  int i, j;
> + __imag__ t += 2;
> +  return g(&t);
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> new file mode 100644
> index 0000000..718b746
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +int g(_Complex int*);
> +int f(void)
> +{
> +  _Complex int t = 0;
> +  int i, j;
> + __real__ t += 2;
> +  return g(&t);
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> new file mode 100644
> index 0000000..4e14d9b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
> +_Complex int t = 0;
> +int f(void)
> +{
> +  t = 0;
> + __imag__ t = 2;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> new file mode 100644
> index 0000000..d1e0b85
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
> +_Complex int t = 0;
> +int f(void)
> +{
> +  t = 0;
> + __real__ t = 2;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> index 594c20c..ae48ddd 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> @@ -11,4 +11,4 @@ foo ()
>  }
>
>  /* We should eliminate the first assignment.  */
> -/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */
> diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
> index 778b363..eea185c 100644
> --- a/gcc/tree-ssa-dse.c
> +++ b/gcc/tree-ssa-dse.c
> @@ -33,6 +33,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-dfa.h"
>  #include "domwalk.h"
>  #include "tree-cfgcleanup.h"
> +#include "params.h"
>
>  /* This file implements dead store elimination.
>
> @@ -68,6 +69,158 @@ along with GCC; see the file COPYING3.  If not see
>     remove their dead edges eventually.  */
>  static bitmap need_eh_cleanup;
>
> +/* STMT is a statement that may write into memory.  Analyze it and
> +   initialize WRITE to describe how STMT affects memory.
> +
> +   Return TRUE if the the statement was analyzed, FALSE otherwise.
> +
> +   It is always safe to return FALSE.  But typically better optimziation
> +   can be achieved by analyzing more statements.  */
> +
> +static bool
> +initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write)
> +{
> +  /* It's advantageous to handle certain mem* functions.  */
> +  if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> +    {
> +      switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt)))
> +       {
> +         case BUILT_IN_MEMCPY:
> +         case BUILT_IN_MEMMOVE:
> +         case BUILT_IN_MEMSET:
> +           {
> +             tree size = NULL_TREE;
> +             if (gimple_call_num_args (stmt) == 3)
> +               size = gimple_call_arg (stmt, 2);
> +             tree ptr = gimple_call_arg (stmt, 0);
> +             ao_ref_init_from_ptr_and_size (write, ptr, size);
> +             return true;
> +           }
> +         default:
> +           break;
> +       }
> +    }
> +  else if (is_gimple_assign (stmt))
> +    {
> +      ao_ref_init (write, gimple_assign_lhs (stmt));
> +      return true;
> +    }
> +  return false;
> +}
> +
> +/* Given REF from the the alias oracle, return TRUE if it is a valid
> +   memory reference for dead store elimination, false otherwise.
> +
> +   In particular, the reference must have a known base, known maximum
> +   size, start at a byte offset and have a size that is one or more
> +   bytes.  */
> +
> +static bool
> +valid_ao_ref_for_dse (ao_ref *ref)
> +{
> +  return (ao_ref_base (ref)
> +         && ref->max_size != -1
> +         && (ref->offset % BITS_PER_UNIT) == 0
> +         && (ref->size % BITS_PER_UNIT) == 0);
> +}
> +
> +/* Clear any bytes written by STMT from the bitmap LIVE_BYTES.  The base
> +   address written by STMT must match the one found in REF, which must
> +   have its base address previously initialized.
> +
> +   This routine must be conservative.  If we don't know the offset or
> +   actual size written, assume nothing was written.  */
> +
> +static void
> +clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref)
> +{
> +  ao_ref write;
> +  if (!initialize_ao_ref_for_dse (stmt, &write))
> +    return;
> +
> +  /* Verify we have the same base memory address and that the write
> +     has a known size.  If so, then clear the appropriate bytes in
> +     the LIVE_BYTES bitmap.  */
> +  if (valid_ao_ref_for_dse (&write)
> +      && write.base == ref->base
> +      && write.size == write.max_size)
> +    bitmap_clear_range (live_bytes,
> +                       write.offset / BITS_PER_UNIT,
> +                       write.size / BITS_PER_UNIT);
> +}
> +
> +/* STMT initializes an object from COMPLEX_CST where one or more of the
> +   bytes written are dead stores.  ORIG is the bitmap of bytes stored by
> +   STMT.  LIVE is the bitmap of stores that are actually live.
> +
> +   Attempt to rewrite STMT so that only the real or imaginary part of
> +   the object is actually stored.  */
> +
> +static void
> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt)
> +{
> +  bitmap dead = BITMAP_ALLOC (NULL);
> +  bitmap_and_compl (dead, orig, live);
> +
> +  /* So we have to verify that either the real or imag part as a whole
> +     is dead.  They are always the same size.  Thus for one to be dead
> +     the number of live bytes would have to be the same as the number of
> +     dead bytes.  */
> +  if (bitmap_count_bits (live) == bitmap_count_bits (dead))

popcount is expensive, so is this really a short-cut?

> +    {
> +      /* Hopefully we have live bits that look like 0xff00 or 0xff
> (assuming
> +        8 bytes for the underlying real/imag part).  If so we can optimize
> +        this case.  */
> +      if (bitmap_first_set_bit (live) == 0
> +         && bitmap_first_set_bit (dead) == bitmap_count_bits (live))

Hmm, but does that properly handle byte-wise reads from it?  Like
reading the realpart plus the last byte from the imagpart?

> +       {
> +         /* TREE_REALPART is live */
> +         tree x = TREE_REALPART (gimple_assign_rhs1 (stmt));
> +         tree y = gimple_assign_lhs (stmt);
> +         y = build1 (REALPART_EXPR, TREE_TYPE (x), y);
> +         gimple_assign_set_lhs (stmt, y);
> +         gimple_assign_set_rhs1 (stmt, x);
> +       }
> +      else if (bitmap_first_set_bit (dead) == 0
> +         && bitmap_first_set_bit (live) == bitmap_count_bits (dead))
> +       {

Likewise.

> +         /* TREE_IMAGPART is live */
> +         tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt));
> +         tree y = gimple_assign_lhs (stmt);
> +         y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y);
> +         gimple_assign_set_lhs (stmt, y);
> +         gimple_assign_set_rhs1 (stmt, x);
> +       }
> +      /* Other cases indicate parts of both the real and imag subobjects
> +        are live.  We do not try to optimize those cases.  */
> +    }
> +  BITMAP_FREE (dead);
> +}
> +
> +/* STMT is a memory write where one or more bytes written are dead
> +   stores.  ORIG is the bitmap of bytes stored by STMT.  LIVE is the
> +   bitmap of stores that are actually live.
> +
> +   Attempt to rewrite STMT so that it writes fewer memory locations.  Right
> +   now we only support trimming at the start or end of the memory region.
> +   It's not clear how much there is to be gained by trimming from the
> middle
> +   of the region.  */
> +
> +static void
> +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt)
> +{
> +  if (is_gimple_assign (stmt))
> +    {
> +      switch (gimple_assign_rhs_code (stmt))
> +       {
> +       case COMPLEX_CST:
> +         trim_complex_store (orig, live, stmt);

so patch #1 only handles stores from COMPLEX_CST?  Ok...

> +         break;
> +       default:
> +         break;
> +       }
> +    }
> +}
>
>  /* A helper of dse_optimize_stmt.
>     Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
> @@ -79,9 +232,32 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>  {
>    gimple *temp;
>    unsigned cnt = 0;
> +  bitmap live_bytes = NULL;
> +  bitmap orig_live_bytes = NULL;
>
>    *use_stmt = NULL;
>
> +  /* REF is a memory write.  Go ahead and get its base, size, extent
> +     information and encode the bytes written into LIVE_BYTES.  We can
> +     handle any case where we have a known base and maximum size.
> +
> +     However, experimentation has shown that bit level tracking is not
> +     useful in practice, so we only track at the byte level.
> +
> +     Furthermore, experimentation has shown that 99% of the cases
> +     that require byte tracking are 64 bytes or less.  */
> +  if (valid_ao_ref_for_dse (ref)
> +      && (ref->max_size / BITS_PER_UNIT
> +         <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
> +    {
> +      live_bytes = BITMAP_ALLOC (NULL);
> +      orig_live_bytes = BITMAP_ALLOC (NULL);
> +      bitmap_set_range (live_bytes,
> +                       ref->offset / BITS_PER_UNIT,
> +                       ref->max_size / BITS_PER_UNIT);
> +      bitmap_copy (orig_live_bytes, live_bytes);

So I'd use a once-per-pass allocated sbitmap here.  I don't see why you need
the orig_live_bytes bitmap though (just keep that implicitely by the known
range?)

> +    }
> +
>    /* Find the first dominated statement that clobbers (part of) the
>       memory stmt stores to with no intermediate statement that may use
>       part of the memory stmt stores.  That is, find a store that may
> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>         }
>
>        if (fail)
> -       return false;
> +       {
> +         /* STMT might be partially dead and we may be able to reduce
> +            how many memory locations it stores into.  */
> +         if (live_bytes
> +             && !bitmap_equal_p (live_bytes, orig_live_bytes)
> +             && !gimple_clobber_p (stmt))
> +           trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);

The actual transform in dse_possible_dead_store_p looks a bit misplaced.
I see it's somehow convenient but then maybe returning a enum from this
function might be cleaner.  Well, I'm not too torn about this, so maybe
just rename the function a bit (no good suggestion either).

The rest of the patch (the infrastructure) looks reasonable.

Richard.

> +         return false;
> +       }
>
>        /* If we didn't find any definition this means the store is dead
>           if it isn't a store to global reachable memory.  In this case
> @@ -177,12 +362,18 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>           temp = stmt;
>           break;
>         }
> +
> +      if (live_bytes && temp)
> +       clear_bytes_written_by (live_bytes, temp, ref);
>      }
> -  /* Continue walking until we reach a kill.  */
> -  while (!stmt_kills_ref_p (temp, ref));
> +  /* Continue walking until we reach a full kill as a single statement
> +     or there are no more live bytes.  */
> +  while (!stmt_kills_ref_p (temp, ref)
> +        && !(live_bytes && bitmap_empty_p (live_bytes)));
>
>    *use_stmt = temp;
> -
> +  BITMAP_FREE (live_bytes);
> +  BITMAP_FREE (orig_live_bytes);
>    return true;
>  }
>
> @@ -214,6 +405,10 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>           || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF))
>      return;
>
> +  ao_ref ref;
> +  if (!initialize_ao_ref_for_dse (stmt, &ref))
> +    return;
> +
>    /* We know we have virtual definitions.  We can handle assignments and
>       some builtin calls.  */
>    if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> @@ -225,12 +420,6 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>           case BUILT_IN_MEMSET:
>             {
>               gimple *use_stmt;
> -             ao_ref ref;
> -             tree size = NULL_TREE;
> -             if (gimple_call_num_args (stmt) == 3)
> -               size = gimple_call_arg (stmt, 2);
> -             tree ptr = gimple_call_arg (stmt, 0);
> -             ao_ref_init_from_ptr_and_size (&ref, ptr, size);
>               if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
>                 return;
>
> @@ -244,6 +433,7 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>               tree lhs = gimple_call_lhs (stmt);
>               if (lhs)
>                 {
> +                 tree ptr = gimple_call_arg (stmt, 0);
>                   gimple *new_stmt = gimple_build_assign (lhs, ptr);
>                   unlink_stmt_vdef (stmt);
>                   if (gsi_replace (gsi, new_stmt, true))
> @@ -274,13 +464,8 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>        if (operand_equal_p (gimple_assign_rhs1 (stmt),
>                            gimple_assign_lhs (stmt), 0))
>         use_stmt = stmt;
> -      else
> -       {
> -         ao_ref ref;
> -         ao_ref_init (&ref, gimple_assign_lhs (stmt));
> -         if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
> -           return;
> -       }
> +      else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
> +       return;
>
>        /* Now we know that use_stmt kills the LHS of stmt.  */
>
>

Reply via email to