On Thu, Sep 17, 2015 at 5:58 PM, Simon Dardis <simon.dar...@imgtec.com> wrote: > I've since taken another look at this recently and I've tracked the issue > down to > tree-predcom.c, specifically ref_at_iteration almost always generating > MEM_REFs. > With MEM_REFs, GCC's RTL GCSE cannot compare them as equal and hence > remove them. A previous version of the code did generate ARRAY_REFs > (pre 204458), but that was changed to generate MEM_REFs for pr/58653.
Ok, so the issue is the full redundancy in <bb 5>: # count_28 = PHI <0(4), count_23(7)> if (pretmp_42 > 1.0e+0) goto <bb 6>; else goto <bb 9>; <bb 6>: _8 = 1.0e+0 / pretmp_42; _12 = _8 * _8; poly[1] = _12; <bb 7>: # prephitmp_30 = PHI <_12(6), _36(9)> # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B]; ... if (count_23 != iterations_6(D)) goto <bb 5>; else goto <bb 8>; <bb 9>: _36 = pretmp_42 * pretmp_42; poly[1] = _36; goto <bb 7>; ? On x86_64 I have only one load from poly[1] left in cprop1, the one using the MEM_REF. It's (insn 31 30 32 6 (set (reg:DF 93 [ poly_I_lsm0.10 ]) (mem/c:DF (const:DI (plus:DI (symbol_ref:DI ("poly") [flags 0x2] <var_decl 0x7f6eab532e10 poly>) (const_int 8 [0x8]))) [1 MEM[(double *)&poly + 8B]+0 S8 A64])) 126 {*movdf_internal} (nil)) vs. (insn 67 65 5 8 (set (mem/c:DF (const:DI (plus:DI (symbol_ref:DI ("poly") [flags 0x2] <var_decl 0x7f6eab532e10 poly>) (const_int 8 [0x8]))) [1 poly+8 S8 A64]) (reg:DF 90)) t.c:22 126 {*movdf_internal} for the partial redundancy across the backedge. I don't see why GCSE should even care for the MEM_EXPR here! > Would something like: > --- a/gcc/tree-predcom.c > +++ b/gcc/tree-predcom.c > @@ -1409,7 +1409,21 @@ ref_at_iteration (data_reference_p dr, int iter, > gimple_seq *stmts) > addr, alias_ptr), > DECL_SIZE (field), bitsize_zero_node); > } > - else > + /* Generate an ARRAY_REF for array references when all details are > INTEGER_CST > + rather than a MEM_REF so that CSE passes can potientially optimize > them. */ > + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF > + && TREE_CODE (DR_STEP (dr)) == INTEGER_CST > + && TREE_CODE (DR_INIT (dr)) == INTEGER_CST > + && TREE_CODE (DR_OFFSET (dr)) == INTEGER_CST) > + { > + /* Reverse engineer the element from memory offset. */ > + tree offset = size_binop (MINUS_EXPR, coff, off); > + tree sizdiv = TYPE_SIZE (TREE_TYPE (TREE_TYPE (DR_BASE_OBJECT (dr)))); > + sizdiv = div_if_zero_remainder (EXACT_DIV_EXPR, sizdiv, ssize_int > (BITS_PER_UNIT)); > + tree element = div_if_zero_remainder (EXACT_DIV_EXPR, offset, sizdiv); > + if (element != NULL_TREE) > + return build4 (ARRAY_REF, TREE_TYPE (DR_REF (dr)), DR_BASE_OBJECT > (dr), > + element, NULL_TREE, NULL_TREE); > + } > return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr); > > be an appropriate start to fixing this? That fix appears to work in in my > testing. No, that seems to be the wrong fix and in the light of wrong-code issues we had with the middle-end generating ARRAY_REFs even more so. What we can do and what would maybe help this case is in ref_at_iteration detect the DR_STEP == 0 case and simply return unshare_expr (DR_REF (dr)) in that case. Well, iff you don't want to explore the GCSE issue... Index: gcc/tree-predcom.c =================================================================== --- gcc/tree-predcom.c (revision 227899) +++ gcc/tree-predcom.c (working copy) @@ -1386,6 +1386,10 @@ replace_ref_with (gimple stmt, tree new_ static tree ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts) { + /* If the reference doesn't change just return it. */ + if (integer_zerop (DR_STEP (dr))) + return unshare_expr (DR_REF (dr)); + tree off = DR_OFFSET (dr); tree coff = DR_INIT (dr); if (iter == 0) hmm, no, then we end up with # prephitmp_30 = PHI <_12(6), _36(9)> # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> poly_I_lsm0.10_38 = poly[1]; _24 = MEM[(double *)&poly + 8B]; beause DOM has a similar issue preventing it from CSEing those two. Richard. > Thanks, > Simon > > -----Original Message----- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 31 August 2015 11:40 > To: Jeff Law > Cc: Simon Dardis; gcc@gcc.gnu.org > Subject: Re: Predictive commoning leads to register to register moves through > memory. > > On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <l...@redhat.com> wrote: >> On 08/28/2015 09:43 AM, Simon Dardis wrote: >> >>> Following Jeff's advice[1] to extract more information from GCC, I've >>> narrowed the cause down to the predictive commoning pass inserting >>> the load in a loop header style basic block. However, the next pass >>> in GCC, tree-cunroll promptly removes the loop and joins the loop >>> header to the body of the (non)loop. More oddly, disabling >>> conditional store elimination pass or the dominator optimizations >>> pass or disabling of jump-threading with --param >>> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any >>> ideas on an approach for this issue? >> >> I'd probably start by looking at the .optimized tree dump in both >> cases to understand the difference, then (most liklely) tracing that >> through the RTL optimizers into the register allocator. > > It's the known issue of LIM (here the one after pcom and complete unrolling > of the inner loop) being too aggressive with store-motion. Here the comptete > array is replaced with registers for the outer loop. Were 'poly' a local > variable we'd have optimized it away completely. > > <bb 6>: > _8 = 1.0e+0 / pretmp_42; > _12 = _8 * _8; > poly[1] = _12; > > <bb 7>: > # prephitmp_30 = PHI <_12(6), _36(9)> > # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> > poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B]; > _2 = prephitmp_30 * poly_I_lsm0.10_38; > _54 = _2 * poly_I_lsm0.10_38; > _67 = poly_I_lsm0.10_38 * _54; > _80 = poly_I_lsm0.10_38 * _67; > _93 = poly_I_lsm0.10_38 * _80; > _106 = poly_I_lsm0.10_38 * _93; > _19 = poly_I_lsm0.10_38 * _106; > count_23 = count_28 + 1; > if (count_23 != iterations_6(D)) > goto <bb 5>; > else > goto <bb 8>; > > <bb 8>: > poly[2] = _2; > poly[3] = _54; > poly[4] = _67; > poly[5] = _80; > poly[6] = _93; > poly[7] = _106; > poly[8] = _19; > i1 = 9; > T = T_lsm.8_22; > > note that DOM misses to CSE poly[1] (a known defect), but heh, doing that > would only increase register pressure even more. > > Note the above is on x86_64. > > Richard. > > >> jeff