On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <l...@redhat.com> wrote: > > This is the first of the 4 part patchkit to address deficiencies in our DSE > implementation. > > > This patch addresses the P2 regression 33562 which has been a low priority > regression since gcc-4.3. To summarize, DSE no longer has the ability to > detect an aggregate store as dead if subsequent stores are done in a > piecemeal fashion. > > I originally tackled this by changing how we lower complex objects. That was > sufficient to address 33562, but was reasonably rejected. > > This version attacks the problem by improving DSE to track stores to memory > at a byte level. That allows us to determine if a series of stores > completely covers an earlier store (thus making the earlier store dead). > > A useful side effect of this is we can detect when parts of a store are dead > and potentially rewrite the store. This patch implements that for complex > object initializations. While not strictly part of 33562, it's so closely > related that I felt it belongs as part of this patch. > > This originally limited the size of the tracked memory space to 64 bytes. I > bumped the limit after working through the CONSTRUCTOR and mem* trimming > patches. The 256 byte limit is still fairly arbitrary and I wouldn't lose > sleep if we throttled back to 64 or 128 bytes. > > Later patches in the kit will build upon this patch. So if pieces look like > skeleton code, that's because it is. > > > Bootstrapped and regression tested on x86_64-linux-gnu. OK for the trunk?
Apart from what Trevor says about using sbitmaps (try to avoid the initial zeroing please) and the missed freeing (you can use auto_[s]bitmap?) some comments below. > PR tree-optimization/33562 > * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM. > * tree-ssa-dse.c: Include params.h. > (initialize_ao_ref_for_dse): New, partially extracted from > dse_optimize_stmt. > (valid_io_ref_for_dse): New. > (clear_bytes_written_by, trim_complex_store): Likewise. > (trim_partially_dead_store): Likewise. > (dse_partially_dead_store_p): Track what bytes were originally > stored > into memory by the statement as well as the subset of bytes that > are still live. If we "fail", but have identified the store as > partially dead, try to rewrite it to store fewer bytes of data. > Exit the main loop if we find a full kill as a single statement > or via group of statements. > (dse_optimize_stmt): Use initialize_ao_ref_for_dse. > > > * gcc.dg/tree-ssa/complex-4.c: No longer xfailed. > * gcc.dg/tree-ssa/complex-5.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-9.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-18.c: New test. > * gcc.dg/tree-ssa/ssa-dse-19.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-20.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-21.c: Likewise. > > diff --git a/gcc/params.def b/gcc/params.def > index 50f75a7..ddc3d65 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER, > "Average number of iterations of a loop.", > 10, 1, 0) > > +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE, > + "dse-max-object-size", > + "Maximum size (in bytes) of objects tracked by dead store > elimination.", > + 256, 0, 0) > + > DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE, > "scev-max-expr-size", > "Bound on size of expressions used in the scalar evolutions > analyzer.", > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > index 87a2638..3155741 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > @@ -10,4 +10,4 @@ int f(void) > return g(&t); > } > > -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail > *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > index e2cd403..e6d027f 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > @@ -8,4 +8,4 @@ int f(void) > __imag__ t = 2; > } > > -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail > *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > new file mode 100644 > index 0000000..92b2df8 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > +int g(_Complex int*); > +int f(void) > +{ > + _Complex int t = 0; > + int i, j; > + __imag__ t += 2; > + return g(&t); > +} > + > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > new file mode 100644 > index 0000000..718b746 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > +int g(_Complex int*); > +int f(void) > +{ > + _Complex int t = 0; > + int i, j; > + __real__ t += 2; > + return g(&t); > +} > + > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > new file mode 100644 > index 0000000..4e14d9b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ > +_Complex int t = 0; > +int f(void) > +{ > + t = 0; > + __imag__ t = 2; > +} > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > new file mode 100644 > index 0000000..d1e0b85 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ > +_Complex int t = 0; > +int f(void) > +{ > + t = 0; > + __real__ t = 2; > +} > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > index 594c20c..ae48ddd 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > @@ -11,4 +11,4 @@ foo () > } > > /* We should eliminate the first assignment. */ > -/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */ > diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c > index 778b363..eea185c 100644 > --- a/gcc/tree-ssa-dse.c > +++ b/gcc/tree-ssa-dse.c > @@ -33,6 +33,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-dfa.h" > #include "domwalk.h" > #include "tree-cfgcleanup.h" > +#include "params.h" > > /* This file implements dead store elimination. > > @@ -68,6 +69,158 @@ along with GCC; see the file COPYING3. If not see > remove their dead edges eventually. */ > static bitmap need_eh_cleanup; > > +/* STMT is a statement that may write into memory. Analyze it and > + initialize WRITE to describe how STMT affects memory. > + > + Return TRUE if the the statement was analyzed, FALSE otherwise. > + > + It is always safe to return FALSE. But typically better optimziation > + can be achieved by analyzing more statements. */ > + > +static bool > +initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write) > +{ > + /* It's advantageous to handle certain mem* functions. */ > + if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > + { > + switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt))) > + { > + case BUILT_IN_MEMCPY: > + case BUILT_IN_MEMMOVE: > + case BUILT_IN_MEMSET: > + { > + tree size = NULL_TREE; > + if (gimple_call_num_args (stmt) == 3) > + size = gimple_call_arg (stmt, 2); > + tree ptr = gimple_call_arg (stmt, 0); > + ao_ref_init_from_ptr_and_size (write, ptr, size); > + return true; > + } > + default: > + break; > + } > + } > + else if (is_gimple_assign (stmt)) > + { > + ao_ref_init (write, gimple_assign_lhs (stmt)); > + return true; > + } > + return false; > +} > + > +/* Given REF from the the alias oracle, return TRUE if it is a valid > + memory reference for dead store elimination, false otherwise. > + > + In particular, the reference must have a known base, known maximum > + size, start at a byte offset and have a size that is one or more > + bytes. */ > + > +static bool > +valid_ao_ref_for_dse (ao_ref *ref) > +{ > + return (ao_ref_base (ref) > + && ref->max_size != -1 > + && (ref->offset % BITS_PER_UNIT) == 0 > + && (ref->size % BITS_PER_UNIT) == 0); > +} > + > +/* Clear any bytes written by STMT from the bitmap LIVE_BYTES. The base > + address written by STMT must match the one found in REF, which must > + have its base address previously initialized. > + > + This routine must be conservative. If we don't know the offset or > + actual size written, assume nothing was written. */ > + > +static void > +clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref) > +{ > + ao_ref write; > + if (!initialize_ao_ref_for_dse (stmt, &write)) > + return; > + > + /* Verify we have the same base memory address and that the write > + has a known size. If so, then clear the appropriate bytes in > + the LIVE_BYTES bitmap. */ > + if (valid_ao_ref_for_dse (&write) > + && write.base == ref->base > + && write.size == write.max_size) > + bitmap_clear_range (live_bytes, > + write.offset / BITS_PER_UNIT, > + write.size / BITS_PER_UNIT); > +} > + > +/* STMT initializes an object from COMPLEX_CST where one or more of the > + bytes written are dead stores. ORIG is the bitmap of bytes stored by > + STMT. LIVE is the bitmap of stores that are actually live. > + > + Attempt to rewrite STMT so that only the real or imaginary part of > + the object is actually stored. */ > + > +static void > +trim_complex_store (bitmap orig, bitmap live, gimple *stmt) > +{ > + bitmap dead = BITMAP_ALLOC (NULL); > + bitmap_and_compl (dead, orig, live); > + > + /* So we have to verify that either the real or imag part as a whole > + is dead. They are always the same size. Thus for one to be dead > + the number of live bytes would have to be the same as the number of > + dead bytes. */ > + if (bitmap_count_bits (live) == bitmap_count_bits (dead)) popcount is expensive, so is this really a short-cut? > + { > + /* Hopefully we have live bits that look like 0xff00 or 0xff > (assuming > + 8 bytes for the underlying real/imag part). If so we can optimize > + this case. */ > + if (bitmap_first_set_bit (live) == 0 > + && bitmap_first_set_bit (dead) == bitmap_count_bits (live)) Hmm, but does that properly handle byte-wise reads from it? Like reading the realpart plus the last byte from the imagpart? > + { > + /* TREE_REALPART is live */ > + tree x = TREE_REALPART (gimple_assign_rhs1 (stmt)); > + tree y = gimple_assign_lhs (stmt); > + y = build1 (REALPART_EXPR, TREE_TYPE (x), y); > + gimple_assign_set_lhs (stmt, y); > + gimple_assign_set_rhs1 (stmt, x); > + } > + else if (bitmap_first_set_bit (dead) == 0 > + && bitmap_first_set_bit (live) == bitmap_count_bits (dead)) > + { Likewise. > + /* TREE_IMAGPART is live */ > + tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt)); > + tree y = gimple_assign_lhs (stmt); > + y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y); > + gimple_assign_set_lhs (stmt, y); > + gimple_assign_set_rhs1 (stmt, x); > + } > + /* Other cases indicate parts of both the real and imag subobjects > + are live. We do not try to optimize those cases. */ > + } > + BITMAP_FREE (dead); > +} > + > +/* STMT is a memory write where one or more bytes written are dead > + stores. ORIG is the bitmap of bytes stored by STMT. LIVE is the > + bitmap of stores that are actually live. > + > + Attempt to rewrite STMT so that it writes fewer memory locations. Right > + now we only support trimming at the start or end of the memory region. > + It's not clear how much there is to be gained by trimming from the > middle > + of the region. */ > + > +static void > +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt) > +{ > + if (is_gimple_assign (stmt)) > + { > + switch (gimple_assign_rhs_code (stmt)) > + { > + case COMPLEX_CST: > + trim_complex_store (orig, live, stmt); so patch #1 only handles stores from COMPLEX_CST? Ok... > + break; > + default: > + break; > + } > + } > +} > > /* A helper of dse_optimize_stmt. > Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate > @@ -79,9 +232,32 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > { > gimple *temp; > unsigned cnt = 0; > + bitmap live_bytes = NULL; > + bitmap orig_live_bytes = NULL; > > *use_stmt = NULL; > > + /* REF is a memory write. Go ahead and get its base, size, extent > + information and encode the bytes written into LIVE_BYTES. We can > + handle any case where we have a known base and maximum size. > + > + However, experimentation has shown that bit level tracking is not > + useful in practice, so we only track at the byte level. > + > + Furthermore, experimentation has shown that 99% of the cases > + that require byte tracking are 64 bytes or less. */ > + if (valid_ao_ref_for_dse (ref) > + && (ref->max_size / BITS_PER_UNIT > + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) > + { > + live_bytes = BITMAP_ALLOC (NULL); > + orig_live_bytes = BITMAP_ALLOC (NULL); > + bitmap_set_range (live_bytes, > + ref->offset / BITS_PER_UNIT, > + ref->max_size / BITS_PER_UNIT); > + bitmap_copy (orig_live_bytes, live_bytes); So I'd use a once-per-pass allocated sbitmap here. I don't see why you need the orig_live_bytes bitmap though (just keep that implicitely by the known range?) > + } > + > /* Find the first dominated statement that clobbers (part of) the > memory stmt stores to with no intermediate statement that may use > part of the memory stmt stores. That is, find a store that may > @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > } > > if (fail) > - return false; > + { > + /* STMT might be partially dead and we may be able to reduce > + how many memory locations it stores into. */ > + if (live_bytes > + && !bitmap_equal_p (live_bytes, orig_live_bytes) > + && !gimple_clobber_p (stmt)) > + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); The actual transform in dse_possible_dead_store_p looks a bit misplaced. I see it's somehow convenient but then maybe returning a enum from this function might be cleaner. Well, I'm not too torn about this, so maybe just rename the function a bit (no good suggestion either). The rest of the patch (the infrastructure) looks reasonable. Richard. > + return false; > + } > > /* If we didn't find any definition this means the store is dead > if it isn't a store to global reachable memory. In this case > @@ -177,12 +362,18 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > temp = stmt; > break; > } > + > + if (live_bytes && temp) > + clear_bytes_written_by (live_bytes, temp, ref); > } > - /* Continue walking until we reach a kill. */ > - while (!stmt_kills_ref_p (temp, ref)); > + /* Continue walking until we reach a full kill as a single statement > + or there are no more live bytes. */ > + while (!stmt_kills_ref_p (temp, ref) > + && !(live_bytes && bitmap_empty_p (live_bytes))); > > *use_stmt = temp; > - > + BITMAP_FREE (live_bytes); > + BITMAP_FREE (orig_live_bytes); > return true; > } > > @@ -214,6 +405,10 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF)) > return; > > + ao_ref ref; > + if (!initialize_ao_ref_for_dse (stmt, &ref)) > + return; > + > /* We know we have virtual definitions. We can handle assignments and > some builtin calls. */ > if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > @@ -225,12 +420,6 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > case BUILT_IN_MEMSET: > { > gimple *use_stmt; > - ao_ref ref; > - tree size = NULL_TREE; > - if (gimple_call_num_args (stmt) == 3) > - size = gimple_call_arg (stmt, 2); > - tree ptr = gimple_call_arg (stmt, 0); > - ao_ref_init_from_ptr_and_size (&ref, ptr, size); > if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > return; > > @@ -244,6 +433,7 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > tree lhs = gimple_call_lhs (stmt); > if (lhs) > { > + tree ptr = gimple_call_arg (stmt, 0); > gimple *new_stmt = gimple_build_assign (lhs, ptr); > unlink_stmt_vdef (stmt); > if (gsi_replace (gsi, new_stmt, true)) > @@ -274,13 +464,8 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > if (operand_equal_p (gimple_assign_rhs1 (stmt), > gimple_assign_lhs (stmt), 0)) > use_stmt = stmt; > - else > - { > - ao_ref ref; > - ao_ref_init (&ref, gimple_assign_lhs (stmt)); > - if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > - return; > - } > + else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > + return; > > /* Now we know that use_stmt kills the LHS of stmt. */ > >