On Thu, 30 Jun 2016, Richard Biener wrote:

> 
> The following patch fixes PR71632 by removing delayed expansion of
> TERed defs.  Instead it adds code to apply the scheduling effect
> to the GIMPLE IL (so you also get better interleaved GIMPLE stmt
> / generated RTL dumps in .expand).
> 
> This removes the quadratic re-expansion of TERed RHS as seen in
> the testcase.
> 
> It affects initial RTL generation in following ways:
> 
>  1) it may expand stmts with unused LHS
>  2) it may expand dead code when use stmts are expanded and those
>     use get_gimple_for_ssa, expanding its operands (again).
>  3) it no longer automatically tries to combine with defs doing
>     memory loads (in case we TERed the load and expansion of the
>     RHS resulted in a MEM).
>  4) the depth-first, left-to-right "expansion" of ops might not
>     be a 100% match of what expand currently does
> 
> I expect that 1) and 2) are mostly a non-issue (and not worse than
> the effect seen in the PR).  I also expect that 3) is "quickly"
> recovered by fwprop or combine.
> 
> While I'm quite sure 3) exists I'm not 100% sure we never return
> a non-constant/reg for SSA name expansion (considering expand
> modifiers like EXPAND_SUM).
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> 
> I think this is also an important step towards eventually doing
> a meaningful GIMPLE-level scheduling (for register pressure,
> also to remove this awkward code-placement code from reassoc
> for example).  Without this TER would disrupt the result quite
> a bit (so the scheduling part of TER would be disabled when scheduling
> on GIMPLE).
> 
> Comments?  I'll throw this on one of our spec testers tonight any
> testing on non-x86 archs appreciated.

So testing works ok apart from a few fails:

x86_64/-m64

FAIL: gcc.target/i386/pr70155-22.c scan-assembler-not \\\\*movdi_internal
FAIL: gcc.target/i386/xorps-sse2.c scan-assembler xorps[ \\t]
FAIL: gcc.target/i386/xorps-sse2.c scan-assembler-not pxor

x86_64/-m32

XPASS: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1
FAIL: gcc.target/i386/movbe-2.c scan-assembler-times movbe[ \\t] 4
FAIL: gcc.target/i386/movq-2.c scan-assembler movzbl[ \\t]*123
FAIL: gcc.target/i386/pr71245-1.c scan-assembler-not (fistp|fild)
FAIL: gcc.target/i386/pr71245-2.c scan-assembler-not movlps
FAIL: gcc.target/i386/xorps-sse2.c scan-assembler xorps[ \\t]
FAIL: gcc.target/i386/xorps-sse2.c scan-assembler-not pxor

I so far looked at gcc.target/i386/pr70155-22.c and without the patch
RTL expansion creates

(insn 7 6 0 2 (parallel [
            (set (mem/c:TI (symbol_ref:DI ("c") [flags 0x40]  <var_decl 
0x7f370f13be10 c>) [1 c+0 S16 A128])
                (plus:TI (mem/c:TI (symbol_ref:DI ("c") [flags 0x40]  
<var_decl 0x7f370f13be10 c>) [1 c+0 S16 A128])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) 
/space/rguenther/src/svn/trunk/gcc/testsuite/gcc.target/i386/pr70155-22.c:10 
210 {*addti3_doubleword}
     (nil))

directly while with the patch we get the load and store separated from the
add but then at combine-time:

Trying 7, 8 -> 9:
Successfully matched this instruction:
(set (mem/c:TI (symbol_ref:DI ("c") [flags 0x40]  <var_decl 0x7f0d8f0d4e10 
c>) [1 c+0 S16 A128])
    (plus:TI (mem/c:TI (symbol_ref:DI ("c") [flags 0x40]  <var_decl 
0x7f0d8f0d4e10 c>) [1 c+0 S16 A128])
        (const_int 1 [0x1])))
rejecting combination of insns 7, 8 and 9
original costs 8 + 8 + 4 = 20
replacement cost 24

so for some reason RTX costs in the backend reject this variant.  Still
the testcase wants it to happen.

I suspect the other testcases run into similar issues.  Like
gcc.target/i386/xorps-sse2.c somehow magically wanting the
vector integer xor operation in the GIMPLE IL be rewritten to vector float
which probably happens through some simplify-rtx tricks at RTL
expansion time (not sure if it is a good idea to do that, the
vector int constant could be a NaN/denormal when loaded as vector float
and thus trigger a similar issue as the testcase wants to avoid,
re-interpret of int vector as float causing slowdowns).

Richard.

> Thanks,
> Richard.
> 
> PS: avoid_deep_ter_for_debug could go away as well I think.
> 
> 2016-06-30  Richard Biener  <rguent...@suse.de>
> 
>       PR middle-end/71632
>       * expr.c (expand_expr_real_1): Do not expand TERed SSA name
>       def rhs.
>       * cfgexpand.c (expand_gimple_basic_block): Expand defs of
>       TERed SSA names.  Move debug stmt use handling ...
>       * tree-outof-ssa.c (move_uses_r): ... here.  New helper
>       for ...
>       (rewrite_trees): ... code to apply the scheduling effect of TER
>       to the GIMPLE IL.
>       (remove_ssa_form): Move rewrite_trees call.
> 
>       * gcc.dg/torture/pr71632.c: New testcase.
> 
> Index: gcc/expr.c
> ===================================================================
> *** gcc/expr.c        (revision 237873)
> --- gcc/expr.c        (working copy)
> *************** expand_expr_real_1 (tree exp, rtx target
> *** 9695,9704 ****
>                             LAST_VIRTUAL_REGISTER + 1);
>       }
>   
> !       g = get_gimple_for_ssa_name (exp);
> !       /* For EXPAND_INITIALIZER try harder to get something simpler.  */
> !       if (g == NULL
> !       && modifier == EXPAND_INITIALIZER
>         && !SSA_NAME_IS_DEFAULT_DEF (exp)
>         && (optimize || !SSA_NAME_VAR (exp)
>             || DECL_IGNORED_P (SSA_NAME_VAR (exp)))
> --- 9677,9685 ----
>                             LAST_VIRTUAL_REGISTER + 1);
>       }
>   
> !       g = NULL;
> !       /* For EXPAND_INITIALIZER try to get something simpler.  */
> !       if (modifier == EXPAND_INITIALIZER
>         && !SSA_NAME_IS_DEFAULT_DEF (exp)
>         && (optimize || !SSA_NAME_VAR (exp)
>             || DECL_IGNORED_P (SSA_NAME_VAR (exp)))
> Index: gcc/cfgexpand.c
> ===================================================================
> *** gcc/cfgexpand.c   (revision 237873)
> --- gcc/cfgexpand.c   (working copy)
> *************** expand_gimple_basic_block (basic_block b
> *** 5510,5611 ****
>         basic_block new_bb;
>   
>         stmt = gsi_stmt (gsi);
> - 
> -       /* If this statement is a non-debug one, and we generate debug
> -      insns, then this one might be the last real use of a TERed
> -      SSA_NAME, but where there are still some debug uses further
> -      down.  Expanding the current SSA name in such further debug
> -      uses by their RHS might lead to wrong debug info, as coalescing
> -      might make the operands of such RHS be placed into the same
> -      pseudo as something else.  Like so:
> -        a_1 = a_0 + 1;   // Assume a_1 is TERed and a_0 is dead
> -        use(a_1);
> -        a_2 = ...
> -            #DEBUG ... => a_1
> -      As a_0 and a_2 don't overlap in lifetime, assume they are coalesced.
> -      If we now would expand a_1 by it's RHS (a_0 + 1) in the debug use,
> -      the write to a_2 would actually have clobbered the place which
> -      formerly held a_0.
> - 
> -      So, instead of that, we recognize the situation, and generate
> -      debug temporaries at the last real use of TERed SSA names:
> -        a_1 = a_0 + 1;
> -            #DEBUG #D1 => a_1
> -        use(a_1);
> -        a_2 = ...
> -            #DEBUG ... => #D1
> -      */
> -       if (MAY_HAVE_DEBUG_INSNS
> -       && SA.values
> -       && !is_gimple_debug (stmt))
> -     {
> -       ssa_op_iter iter;
> -       tree op;
> -       gimple *def;
> - 
> -       location_t sloc = curr_insn_location ();
> - 
> -       /* Look for SSA names that have their last use here (TERed
> -          names always have only one real use).  */
> -       FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
> -         if ((def = get_gimple_for_ssa_name (op)))
> -           {
> -             imm_use_iterator imm_iter;
> -             use_operand_p use_p;
> -             bool have_debug_uses = false;
> - 
> -             FOR_EACH_IMM_USE_FAST (use_p, imm_iter, op)
> -               {
> -                 if (gimple_debug_bind_p (USE_STMT (use_p)))
> -                   {
> -                     have_debug_uses = true;
> -                     break;
> -                   }
> -               }
> - 
> -             if (have_debug_uses)
> -               {
> -                 /* OP is a TERed SSA name, with DEF its defining
> -                    statement, and where OP is used in further debug
> -                    instructions.  Generate a debug temporary, and
> -                    replace all uses of OP in debug insns with that
> -                    temporary.  */
> -                 gimple *debugstmt;
> -                 tree value = gimple_assign_rhs_to_tree (def);
> -                 tree vexpr = make_node (DEBUG_EXPR_DECL);
> -                 rtx val;
> -                 machine_mode mode;
> - 
> -                 set_curr_insn_location (gimple_location (def));
> - 
> -                 DECL_ARTIFICIAL (vexpr) = 1;
> -                 TREE_TYPE (vexpr) = TREE_TYPE (value);
> -                 if (DECL_P (value))
> -                   mode = DECL_MODE (value);
> -                 else
> -                   mode = TYPE_MODE (TREE_TYPE (value));
> -                 DECL_MODE (vexpr) = mode;
> - 
> -                 val = gen_rtx_VAR_LOCATION
> -                     (mode, vexpr, (rtx)value, VAR_INIT_STATUS_INITIALIZED);
> - 
> -                 emit_debug_insn (val);
> - 
> -                 FOR_EACH_IMM_USE_STMT (debugstmt, imm_iter, op)
> -                   {
> -                     if (!gimple_debug_bind_p (debugstmt))
> -                       continue;
> - 
> -                     FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
> -                       SET_USE (use_p, vexpr);
> - 
> -                     update_stmt (debugstmt);
> -                   }
> -               }
> -           }
> -       set_curr_insn_location (sloc);
> -     }
> - 
>         currently_expanding_gimple_stmt = stmt;
>   
>         /* Expand this statement, then evaluate the resulting RTL and
> --- 5510,5515 ----
> *************** expand_gimple_basic_block (basic_block b
> *** 5732,5749 ****
>           }
>         else
>           {
> -           def_operand_p def_p;
> -           def_p = SINGLE_SSA_DEF_OPERAND (stmt, SSA_OP_DEF);
> - 
> -           if (def_p != NULL)
> -             {
> -               /* Ignore this stmt if it is in the list of
> -                  replaceable expressions.  */
> -               if (SA.values
> -                   && bitmap_bit_p (SA.values,
> -                                    SSA_NAME_VERSION (DEF_FROM_PTR (def_p))))
> -                 continue;
> -             }
>             last = expand_gimple_stmt (stmt);
>             maybe_dump_rtl_for_gimple_stmt (stmt, last);
>           }
> --- 5636,5641 ----
> Index: gcc/tree-outof-ssa.c
> ===================================================================
> *** gcc/tree-outof-ssa.c      (revision 237873)
> --- gcc/tree-outof-ssa.c      (working copy)
> *************** along with GCC; see the file COPYING3.
> *** 42,47 ****
> --- 42,49 ----
>   #include "tree-ssa-coalesce.h"
>   #include "tree-outof-ssa.h"
>   #include "dojump.h"
> + #include "tree-ssa.h"
> + #include "gimple-walk.h"
>   
>   /* FIXME: A lot of code here deals with expanding to RTL.  All that code
>      should be in cfgexpand.c.  */
> *************** eliminate_useless_phis (void)
> *** 870,875 ****
> --- 872,941 ----
>   }
>   
>   
> + /* walk_gimple_op callback to schedule TERed defs before uses.  */
> + 
> + static tree
> + move_uses_r (tree *tp, int *, void *data)
> + {
> +   walk_stmt_info *wi = (walk_stmt_info *)data;
> + 
> +   if (TREE_CODE (*tp) == SSA_NAME
> +       && bitmap_bit_p ((bitmap)wi->info, SSA_NAME_VERSION (*tp)))
> +     {
> +       gimple *def = SSA_NAME_DEF_STMT (*tp);
> +       /* We are only interested in uses.  */
> +       if (def == wi->stmt)
> +     return NULL_TREE;
> + 
> +       /* Avoid useless work.  We should enhance gsi_move_before
> +      so it can avoid the redundant stmt updating it
> +      performs as it uses gsi_remove/gsi_insert_before.  */
> +       if (wi->gsi.ptr->prev == def)
> +     {
> +       gsi_prev (&wi->gsi);
> +       return NULL_TREE;
> +     }
> + 
> +       /* If this statement is a non-debug one, and we generate
> +      debug insns, then this one might be the last real use of
> +      a TERed SSA_NAME, but where there are still some debug
> +      uses further down.  Expanding the current SSA name in
> +      such further debug uses by their RHS might lead to wrong
> +      debug info, as coalescing might make the operands of
> +      such RHS be placed into the same pseudo as something
> +      else.  Like so:
> +        a_1 = a_0 + 1;   // Assume a_1 is TERed and a_0 is dead
> +        use(a_1);
> +        a_2 = ...
> +        #DEBUG ... => a_1
> +      As a_0 and a_2 don't overlap in lifetime, assume they
> +      are coalesced.
> +      If we now would expand a_1 by it's RHS (a_0 + 1) in the
> +      debug use, the write to a_2 would actually have
> +      clobbered the place which formerly held a_0.
> + 
> +      So, instead of that, we recognize the situation, and
> +      generate debug temporaries at the last real use of
> +      TERed SSA names:
> +        a_1 = a_0 + 1;
> +        #DEBUG #D1 => a_1
> +        use(a_1);
> +        a_2 = ...
> +        #DEBUG ... => #D1  */
> +       insert_debug_temp_for_var_def (NULL, *tp);
> + 
> +       /* wi->gsi starts at 'stmt' and the walker visits operands
> +          left-to-right.  Implement depth-first and left-to-right
> +      evaluation (the depth step is performed by walking the
> +      here inserted stmts backwards).  */
> +       gimple_stmt_iterator gsi2 = gsi_for_stmt (def);
> +       gsi_remove (&gsi2, false);
> +       gsi_insert_before (&wi->gsi, def, GSI_NEW_STMT);
> +     }
> + 
> +   return NULL_TREE;
> + }
> + 
>   /* This function will rewrite the current program using the variable mapping
>      found in MAP.  If the replacement vector VALUES is provided, any
>      occurrences of partitions with non-null entries in the vector will be
> *************** eliminate_useless_phis (void)
> *** 877,888 ****
>      variable.  */
>   
>   static void
> ! rewrite_trees (var_map map)
>   {
>     if (!flag_checking)
>       return;
>   
> -   basic_block bb;
>     /* Search for PHIs where the destination has no partition, but one
>        or more arguments has a partition.  This should not happen and can
>        create incorrect code.  */
> --- 943,979 ----
>      variable.  */
>   
>   static void
> ! rewrite_trees (var_map map, struct ssaexpand *sa)
>   {
> +   basic_block bb;
> + 
> +   /* Apply TER stmt re-ordering.  */
> +   if (sa->values)
> +     FOR_EACH_BB_FN (bb, cfun)
> +       {
> +     /* Re-materialize replaceable expressions before stmts using
> +        them.  We walk the BBs backwards so we simulate the way
> +        delayed expansion worked by expanding things depth-first
> +        and left-to-right (in most cases).  */
> +     for (gimple_stmt_iterator gsi = gsi_last_bb (bb);
> +          !gsi_end_p (gsi); gsi_prev (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         if (is_gimple_debug (stmt))
> +           continue;
> + 
> +         walk_stmt_info wi;
> +         memset (&wi, 0, sizeof (walk_stmt_info));
> +         wi.stmt = stmt;
> +         wi.gsi = gsi;
> +         wi.info = SA.values;
> +         walk_gimple_op (stmt, move_uses_r, &wi);
> +       }
> +       }
> + 
>     if (!flag_checking)
>       return;
>   
>     /* Search for PHIs where the destination has no partition, but one
>        or more arguments has a partition.  This should not happen and can
>        create incorrect code.  */
> *************** remove_ssa_form (bool perform_ter, struc
> *** 994,1004 ****
>       dump_replaceable_exprs (dump_file, values);
>       }
>   
> -   rewrite_trees (map);
> - 
>     sa->map = map;
>     sa->values = values;
>     sa->partitions_for_parm_default_defs = get_parm_default_def_partitions 
> (map);
>   }
>   
>   
> --- 1085,1095 ----
>       dump_replaceable_exprs (dump_file, values);
>       }
>   
>     sa->map = map;
>     sa->values = values;
>     sa->partitions_for_parm_default_defs = get_parm_default_def_partitions 
> (map);
> + 
> +   rewrite_trees (map, sa);
>   }
>   
>   
> 
> Index: gcc/testsuite/gcc.dg/torture/pr71632.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/torture/pr71632.c    (revision 0)
> +++ gcc/testsuite/gcc.dg/torture/pr71632.c    (working copy)
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +
> +void
> +foo (double d, double *p, double *q)
> +{
> +  int i;
> +  for (i = 0; i < 64; i++)
> +    {
> +      double t1 = d > p[0] ? 1.0 : 0.0;
> +      double t2 = t1 > p[1] ? 1.0 : 0.0;
> +      double t3 = t2 > p[2] ? 1.0 : 0.0;
> +      double t4 = t3 > p[3] ? 1.0 : 0.0;
> +      double t5 = t4 > p[4] ? 1.0 : 0.0;
> +      double t6 = t5 > p[5] ? 1.0 : 0.0;
> +      double t7 = t6 > p[6] ? 1.0 : 0.0;
> +      double t8 = t7 > p[7] ? 1.0 : 0.0;
> +      double t9 = t8 > p[8] ? 1.0 : 0.0;
> +      double t10 = t9 > p[9] ? 1.0 : 0.0;
> +      double t11 = t10 > p[10] ? 1.0 : 0.0;
> +      double t12 = t11 > p[11] ? 1.0 : 0.0;
> +      double t13 = t12 > p[12] ? 1.0 : 0.0;
> +      double t14 = t13 > p[13] ? 1.0 : 0.0;
> +      double t15 = t14 > p[14] ? 1.0 : 0.0;
> +      double t16 = t15 > p[15] ? 1.0 : 0.0;
> +      double t17 = t16 > p[16] ? 1.0 : 0.0;
> +      double t18 = t17 > p[17] ? 1.0 : 0.0;
> +      double t19 = t18 > p[18] ? 1.0 : 0.0;
> +      double t20 = t19 > p[19] ? 1.0 : 0.0;
> +      double t21 = t20 > p[20] ? 1.0 : 0.0;
> +      double t22 = t21 > p[21] ? 1.0 : 0.0;
> +      double t23 = t22 > p[22] ? 1.0 : 0.0;
> +      double t24 = t23 > p[23] ? 1.0 : 0.0;
> +      double t25 = t24 > p[24] ? 1.0 : 0.0;
> +      double t26 = t25 > p[25] ? 1.0 : 0.0;
> +      double t27 = t26 > p[26] ? 1.0 : 0.0;
> +      double t28 = t27 > p[27] ? 1.0 : 0.0;
> +      double t29 = t28 > p[28] ? 1.0 : 0.0;
> +      double t30 = t29 > p[29] ? 1.0 : 0.0;
> +      double t31 = t30 > p[30] ? 1.0 : 0.0;
> +      double t32 = t31 > p[31] ? 1.0 : 0.0;
> +      double t33 = t32 > p[32] ? 1.0 : 0.0;
> +      double t34 = t33 > p[33] ? 1.0 : 0.0;
> +      double t35 = t34 > p[34] ? 1.0 : 0.0;
> +      double t36 = t35 > p[35] ? 1.0 : 0.0;
> +      double t37 = t36 > p[36] ? 1.0 : 0.0;
> +      double t38 = t37 > p[37] ? 1.0 : 0.0;
> +      double t39 = t38 > p[38] ? 1.0 : 0.0;
> +      *q = t39;
> +      p += 39;
> +      q++;
> +    }
> +}
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to