Re: [PR81165] discount killed stmts when sizing blocks for threading

Richard Biener Tue, 12 Dec 2017 00:27:57 -0800

On Tue, Dec 12, 2017 at 6:17 AM, Alexandre Oliva <aol...@redhat.com> wrote:
> On Dec 11, 2017, Jeff Law <l...@redhat.com> wrote:
>
>>> +  gcc_assert (path->length () == 0 || path->last ()->e == e);
>>> +  if (path->length () == 0)
>>> +    return estimate_threading_killed_stmts (e->dest);
>>> +
>>> +  int total = 0;
>>> +  int i = 0;
>>> +  jump_thread_edge *je;
>>> +  FOR_EACH_VEC_ELT (*path, i, je)
>>> +    switch (je->type)
>>> +      {
>>> +      case EDGE_START_JUMP_THREAD:
>>> +      case EDGE_COPY_SRC_BLOCK:
>>> +      case EDGE_COPY_SRC_JOINER_BLOCK:
>>> +    total += estimate_threading_killed_stmts (je->e->dest);
>>> +    break;
>>> +
>>> +      default:
>>> +    gcc_unreachable ();
>>> +
>>> +      case EDGE_FSM_THREAD:
>>> +      case EDGE_NO_COPY_SRC_BLOCK:
>>> +    break;
>>> +      }
>>> +
>>> +  return total;
>>> +}
>
>
>> So I'd mark je->e->src rather than je->e->dest.
>
> Ah, but then we have to mark e->dest regardless of path.  Which does
> indeed make it all look nicer.
>
>> And you only need this
>> handling for EDGE_COPY_SRC_BLOCK.  So I don't think you need a switch,
>> just a simple je->type == EDGE_COPY_SRC_BLOCK.
>
>> While we do make a copy for EDGE_COPY_SRC_JOINER_BLOCK, the conditional
>> at the end of that block is not dead, so we can't really do anything
>> special with it.
>
> I see, thanks.
>
> I've updated it according to richi's and your feedbacks.  Regstrapped on
> {x86_64,i686}-linux-gnu.  Ok to install?


Thanks for the numbers on size, OK for the parts I reviewed, leaving the
final ack to Jeff.

Richard.

>
> We limit the amount of copying for jump threading based on counting
> stmts.  This counting is overly pessimistic, because we will very
> often delete stmts as a consequence of jump threading: when the final
> conditional jump of a block is removed, earlier SSA names computed
> exclusively for use in that conditional are killed.  Furthermore, PHI
> nodes in blocks with only two predecessors are trivially replaced with
> their now-single values after threading.
>
> This patch scans blocks to be copied in the path constructed so far
> and estimates the number of stmts that will be removed in the copies,
> bumping up the stmt count limit.
>
> for  gcc/ChangeLog
>
>         PR tree-optimization/81165
>         * tree-ssa-threadedge.c (uses_in_bb): New.
>         (estimate_threading_killed_stmts): New.
>         (estimate_threading_killed_stmts): New overload.
>         (record_temporary_equivalences_from_stmts_at_dest): Add path
>         parameter; adjust caller.  Expand limit when it's hit.
>
> for  gcc/testsuite/ChangeLog
>
>         PR tree-optimization/81165
>         * gcc.dg/pr81165.c: New.
> ---
>  gcc/testsuite/gcc.dg/pr81165.c |   59 +++++++++++++
>  gcc/tree-ssa-threadedge.c      |  176 
> +++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 232 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr81165.c
>
> diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c
> new file mode 100644
> index 000000000000..8508d893bed6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr81165.c
> @@ -0,0 +1,59 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */
> +
> +/* Testcase submitted for PR81165, with its main function removed as
> +   it's turned into a compile test.  We want to make sure that all of
> +   the divide/remainder computations are removed by tree optimizers.
> +
> +   We can figure out that we don't need to compute at runtime even the
> +   condition to enter the loop: the initial i==0 would have to be
> +   greater than the sum of two small unsigned values: 1U>>t1 is in the
> +   range 0..1, whereas the char value is bounded by the range 0..127,
> +   being 128 % a positive number (zero would invoke undefined
> +   behavior, so we can assume it doesn't happen).  (We know it's
> +   nonnegative because it's 10 times a number that has no more than
> +   the bits for 16, 8 and 1 set.)
> +
> +   We don't realize that the loop is useless right away: jump
> +   threading helps remove some of the complexity, particularly of the
> +   computation within the loop: t1 is compared with 1, but it can
> +   never be 1.  (We could assume as much, since its being 1 would
> +   divide by zero, but we don't.)
> +
> +   If we don't enter the conditional block, t1 remains at 2; if we do,
> +   it's set to either -1.  If we jump thread at the end of the
> +   conditional block, we can figure out the ranges exclude 1 and the
> +   jump body is completely optimized out.  However, we used to fail to
> +   consider the block for jump threading due to the amount of
> +   computation in it, without realizing most of it would die in
> +   consequence of the threading.
> +
> +   We now take the dying code into account when deciding whether or
> +   not to try jump threading.  That might enable us to optimize the
> +   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the
> +   time of this writing, with the patch, we get close, but the test on
> +   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some
> +   of the loop remains.  */
> +
> +short x0 = 15;
> +
> +void func (){
> +  volatile int x1 = 1U;
> +  volatile char x2 = 0;
> +  char t0 = 0;
> +  unsigned long t1 = 2LU;
> +  int i = 0;
> +
> +  if(1>>x2) {
> +    t0 = -1;
> +    t1 = (1&(short)(x1^8U))-1;
> +  }
> +
> +  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {
> +    i += (int)(12L/(1!=(int)t1));
> +  }
> +
> +  if (t0 != -1) __builtin_abort();
> +  if (t1 != 0L) __builtin_abort();
> +}
> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
> index 91793bfa59d3..0f5b943aa9a0 100644
> --- a/gcc/tree-ssa-threadedge.c
> +++ b/gcc/tree-ssa-threadedge.c
> @@ -170,6 +170,160 @@ threadedge_valueize (tree t)
>    return t;
>  }
>
> +/* Return how many uses of T there are within BB, as long as there
> +   aren't any uses outside BB.  If there are any uses outside BB,
> +   return -1 if there's at most one use within BB, or -2 if there is
> +   more than one use within BB.  */
> +
> +static int
> +uses_in_bb (tree t, basic_block bb)
> +{
> +  int uses = 0;
> +  bool outside_bb = false;
> +
> +  imm_use_iterator iter;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_FAST (use_p, iter, t)
> +    {
> +      if (is_gimple_debug (USE_STMT (use_p)))
> +       continue;
> +
> +      if (gimple_bb (USE_STMT (use_p)) != bb)
> +       outside_bb = true;
> +      else
> +       uses++;
> +
> +      if (outside_bb && uses > 1)
> +       return -2;
> +    }
> +
> +  if (outside_bb)
> +    return -1;
> +
> +  return uses;
> +}
> +
> +/* Starting from the final control flow stmt in BB, assuming it will
> +   be removed, follow uses in to-be-removed stmts back to their defs
> +   and count how many defs are to become dead and be removed as
> +   well.  */
> +
> +static int
> +estimate_threading_killed_stmts (basic_block bb)
> +{
> +  int killed_stmts = 0;
> +  hash_map<tree, int> ssa_remaining_uses;
> +  auto_vec<gimple *, 4> dead_worklist;
> +
> +  /* If the block has only two predecessors, threading will turn phi
> +     dsts into either src, so count them as dead stmts.  */
> +  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;
> +
> +  if (drop_all_phis)
> +    for (gphi_iterator gsi = gsi_start_phis (bb);
> +        !gsi_end_p (gsi); gsi_next (&gsi))
> +      {
> +       gphi *phi = gsi.phi ();
> +       tree dst = gimple_phi_result (phi);
> +
> +       /* We don't count virtual PHIs as stmts in
> +          record_temporary_equivalences_from_phis.  */
> +       if (virtual_operand_p (dst))
> +         continue;
> +
> +       killed_stmts++;
> +      }
> +
> +  if (gsi_end_p (gsi_last_bb (bb)))
> +    return killed_stmts;
> +
> +  gimple *stmt = gsi_stmt (gsi_last_bb (bb));
> +  if (gimple_code (stmt) != GIMPLE_COND
> +      && gimple_code (stmt) != GIMPLE_GOTO
> +      && gimple_code (stmt) != GIMPLE_SWITCH)
> +    return killed_stmts;
> +
> +  dead_worklist.quick_push (stmt);
> +  while (!dead_worklist.is_empty ())
> +    {
> +      stmt = dead_worklist.pop ();
> +
> +      ssa_op_iter iter;
> +      use_operand_p use_p;
> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
> +       {
> +         tree t = USE_FROM_PTR (use_p);
> +         gimple *def = SSA_NAME_DEF_STMT (t);
> +
> +         if (gimple_bb (def) == bb
> +             && (gimple_code (def) != GIMPLE_PHI
> +                 || !drop_all_phis)
> +             && !gimple_has_side_effects (def))
> +           {
> +             int *usesp = ssa_remaining_uses.get (t);
> +             int uses;
> +
> +             if (usesp)
> +               uses = *usesp;
> +             else
> +               uses = uses_in_bb (t, bb);
> +
> +             gcc_assert (uses);
> +
> +             /* Don't bother recording the expected use count if we
> +                won't find any further uses within BB.  */
> +             if (!usesp && (uses < -1 || uses > 1))
> +               {
> +                 usesp = &ssa_remaining_uses.get_or_insert (t);
> +                 *usesp = uses;
> +               }
> +
> +             if (uses < 0)
> +               continue;
> +
> +             --uses;
> +             if (usesp)
> +               *usesp = uses;
> +
> +             if (!uses)
> +               {
> +                 killed_stmts++;
> +                 if (usesp)
> +                   ssa_remaining_uses.remove (t);
> +                 if (gimple_code (def) != GIMPLE_PHI)
> +                   dead_worklist.safe_push (def);
> +               }
> +           }
> +       }
> +    }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "threading bb %i kills %i stmts\n",
> +            bb->index, killed_stmts);
> +
> +  return killed_stmts;
> +}
> +
> +/* Estimate, over all src blocks to be copied in PATH and E's dest,
> +   the number of stmts that become dead as a consequence of removing
> +   their final control flow stmts.  If PATH is not empty, E must be
> +   its last entry.  */
> +
> +static int
> +estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)
> +{
> +  gcc_assert (path->length () == 0 || path->last ()->e == e);
> +  int total = estimate_threading_killed_stmts (e->dest);
> +
> +  int i;
> +  jump_thread_edge *je;
> +  FOR_EACH_VEC_ELT_REVERSE (*path, i, je)
> +    if (je->type == EDGE_COPY_SRC_BLOCK)
> +      total += estimate_threading_killed_stmts (je->e->src);
> +
> +  return total;
> +}
> +
>  /* Try to simplify each statement in E->dest, ultimately leading to
>     a simplification of the COND_EXPR at the end of E->dest.
>
> @@ -191,7 +345,8 @@ static gimple *
>  record_temporary_equivalences_from_stmts_at_dest (edge e,
>      const_and_copies *const_and_copies,
>      avail_exprs_stack *avail_exprs_stack,
> -    pfn_simplify simplify)
> +    pfn_simplify simplify,
> +    vec<jump_thread_edge *> *path)
>  {
>    gimple *stmt = NULL;
>    gimple_stmt_iterator gsi;
> @@ -233,7 +388,22 @@ record_temporary_equivalences_from_stmts_at_dest (edge e,
>          expansion, then do not thread through this block.  */
>        stmt_count++;
>        if (stmt_count > max_stmt_count)
> -       return NULL;
> +       {
> +         /* If any of the stmts in the PATH's dests are going to be
> +            killed due to threading, grow the max count
> +            accordingly.  */
> +         if (max_stmt_count
> +             == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))
> +           {
> +             max_stmt_count += estimate_threading_killed_stmts (e, path);
> +             if (dump_file)
> +               fprintf (dump_file, "threading bb %i up to %i stmts\n",
> +                        e->dest->index, max_stmt_count);
> +           }
> +         /* If we're still past the limit, we're done.  */
> +         if (stmt_count > max_stmt_count)
> +           return NULL;
> +       }
>
>        /* If this is not a statement that sets an SSA_NAME to a new
>          value, then do not try to simplify this statement as it will
> @@ -1000,7 +1170,7 @@ thread_through_normal_block (edge e,
>    gimple *stmt
>      = record_temporary_equivalences_from_stmts_at_dest (e, const_and_copies,
>                                                         avail_exprs_stack,
> -                                                       simplify);
> +                                                       simplify, path);
>
>    /* There's two reasons STMT might be null, and distinguishing
>       between them is important.
>
>
> --
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

Re: [PR81165] discount killed stmts when sizing blocks for threading

Reply via email to