On Thu, Sep 17, 2015 at 5:58 PM, Simon Dardis <simon.dar...@imgtec.com> wrote:
> I've since taken another look at this recently and I've tracked the issue 
> down to
> tree-predcom.c, specifically ref_at_iteration almost always generating 
> MEM_REFs.
> With MEM_REFs, GCC's RTL GCSE cannot compare them as equal and hence
> remove them. A previous version of the code did generate ARRAY_REFs
> (pre 204458), but that was changed to generate MEM_REFs for pr/58653.

Ok, so the issue is the full redundancy in

  <bb 5>:
  # count_28 = PHI <0(4), count_23(7)>
  if (pretmp_42 > 1.0e+0)
    goto <bb 6>;
  else
    goto <bb 9>;

  <bb 6>:
  _8 = 1.0e+0 / pretmp_42;
  _12 = _8 * _8;
  poly[1] = _12;

  <bb 7>:
  # prephitmp_30 = PHI <_12(6), _36(9)>
  # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
  poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B];
...
  if (count_23 != iterations_6(D))
    goto <bb 5>;
  else
    goto <bb 8>;

  <bb 9>:
  _36 = pretmp_42 * pretmp_42;
  poly[1] = _36;
  goto <bb 7>;

?

On x86_64 I have only one load from poly[1] left in cprop1, the one
using the MEM_REF.  It's

(insn 31 30 32 6 (set (reg:DF 93 [ poly_I_lsm0.10 ])
        (mem/c:DF (const:DI (plus:DI (symbol_ref:DI ("poly") [flags
0x2]  <var_decl 0x7f6eab532e10 poly>)
                    (const_int 8 [0x8]))) [1 MEM[(double *)&poly +
8B]+0 S8 A64])) 126 {*movdf_internal}
     (nil))

vs.

(insn 67 65 5 8 (set (mem/c:DF (const:DI (plus:DI (symbol_ref:DI
("poly") [flags 0x2]  <var_decl 0x7f6eab532e10 poly>)
                    (const_int 8 [0x8]))) [1 poly+8 S8 A64])
        (reg:DF 90)) t.c:22 126 {*movdf_internal}

for the partial redundancy across the backedge.  I don't see why GCSE
should even care for the MEM_EXPR
here!

> Would something like:
> --- a/gcc/tree-predcom.c
> +++ b/gcc/tree-predcom.c
> @@ -1409,7 +1409,21 @@ ref_at_iteration (data_reference_p dr, int iter, 
> gimple_seq *stmts)
>                              addr, alias_ptr),
>                      DECL_SIZE (field), bitsize_zero_node);
>      }
> -  else
> +  /* Generate an ARRAY_REF for array references when all details are 
> INTEGER_CST
> +     rather than a MEM_REF so that CSE passes can potientially optimize 
> them.  */
> +  else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF
> +          && TREE_CODE (DR_STEP (dr)) == INTEGER_CST
> +          && TREE_CODE (DR_INIT (dr)) == INTEGER_CST
> +          && TREE_CODE (DR_OFFSET (dr)) == INTEGER_CST)
> +    {
> +       /* Reverse engineer the element from memory offset.  */
> +       tree offset = size_binop (MINUS_EXPR, coff, off);
> +       tree sizdiv = TYPE_SIZE (TREE_TYPE (TREE_TYPE (DR_BASE_OBJECT (dr))));
> +       sizdiv = div_if_zero_remainder (EXACT_DIV_EXPR, sizdiv, ssize_int 
> (BITS_PER_UNIT));
> +       tree element = div_if_zero_remainder (EXACT_DIV_EXPR, offset, sizdiv);
> +       if (element != NULL_TREE)
> +         return build4 (ARRAY_REF, TREE_TYPE (DR_REF (dr)), DR_BASE_OBJECT 
> (dr),
> +                        element, NULL_TREE, NULL_TREE);
> +    }
>      return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr);
>
> be an appropriate start to fixing this? That fix appears to work in in my 
> testing.

No, that seems to be the wrong fix and in the light of wrong-code
issues we had with the middle-end
generating ARRAY_REFs even more so.

What we can do and what would maybe help this case is in
ref_at_iteration detect the DR_STEP == 0 case
and simply return unshare_expr (DR_REF (dr)) in that case.  Well, iff
you don't want to explore the
GCSE issue...

Index: gcc/tree-predcom.c
===================================================================
--- gcc/tree-predcom.c  (revision 227899)
+++ gcc/tree-predcom.c  (working copy)
@@ -1386,6 +1386,10 @@ replace_ref_with (gimple stmt, tree new_
 static tree
 ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts)
 {
+  /* If the reference doesn't change just return it.  */
+  if (integer_zerop (DR_STEP (dr)))
+    return unshare_expr (DR_REF (dr));
+
   tree off = DR_OFFSET (dr);
   tree coff = DR_INIT (dr);
   if (iter == 0)

hmm, no, then we end up with

  # prephitmp_30 = PHI <_12(6), _36(9)>
  # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
  poly_I_lsm0.10_38 = poly[1];
  _24 = MEM[(double *)&poly + 8B];

beause DOM has a similar issue preventing it from CSEing those two.

Richard.

> Thanks,
> Simon
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 31 August 2015 11:40
> To: Jeff Law
> Cc: Simon Dardis; gcc@gcc.gnu.org
> Subject: Re: Predictive commoning leads to register to register moves through 
> memory.
>
> On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <l...@redhat.com> wrote:
>> On 08/28/2015 09:43 AM, Simon Dardis wrote:
>>
>>> Following Jeff's advice[1] to extract more information from GCC, I've
>>> narrowed the cause down to the predictive commoning pass inserting
>>> the load in a loop header style basic block. However, the next pass
>>> in GCC, tree-cunroll promptly removes the loop and joins the loop
>>> header to the body of the (non)loop. More oddly, disabling
>>> conditional store elimination pass or the dominator optimizations
>>> pass or disabling of jump-threading with --param
>>> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any
>>> ideas on an approach for this issue?
>>
>> I'd probably start by looking at the .optimized tree dump in both
>> cases to understand the difference, then (most liklely) tracing that
>> through the RTL optimizers into the register allocator.
>
> It's the known issue of LIM (here the one after pcom and complete unrolling 
> of the inner loop) being too aggressive with store-motion.  Here the comptete 
> array is replaced with registers for the outer loop.  Were 'poly' a local 
> variable we'd have optimized it away completely.
>
>   <bb 6>:
>   _8 = 1.0e+0 / pretmp_42;
>   _12 = _8 * _8;
>   poly[1] = _12;
>
>   <bb 7>:
>   # prephitmp_30 = PHI <_12(6), _36(9)>
>   # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
>   poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B];
>   _2 = prephitmp_30 * poly_I_lsm0.10_38;
>   _54 = _2 * poly_I_lsm0.10_38;
>   _67 = poly_I_lsm0.10_38 * _54;
>   _80 = poly_I_lsm0.10_38 * _67;
>   _93 = poly_I_lsm0.10_38 * _80;
>   _106 = poly_I_lsm0.10_38 * _93;
>   _19 = poly_I_lsm0.10_38 * _106;
>   count_23 = count_28 + 1;
>   if (count_23 != iterations_6(D))
>     goto <bb 5>;
>   else
>     goto <bb 8>;
>
>   <bb 8>:
>   poly[2] = _2;
>   poly[3] = _54;
>   poly[4] = _67;
>   poly[5] = _80;
>   poly[6] = _93;
>   poly[7] = _106;
>   poly[8] = _19;
>   i1 = 9;
>   T = T_lsm.8_22;
>
> note that DOM misses to CSE poly[1] (a known defect), but heh, doing that 
> would only increase register pressure even more.
>
> Note the above is on x86_64.
>
> Richard.
>
>
>> jeff

Reply via email to