I've since taken another look at this recently and I've tracked the issue down to tree-predcom.c, specifically ref_at_iteration almost always generating MEM_REFs. With MEM_REFs, GCC's RTL GCSE cannot compare them as equal and hence remove them. A previous version of the code did generate ARRAY_REFs (pre 204458), but that was changed to generate MEM_REFs for pr/58653.
Would something like: --- a/gcc/tree-predcom.c +++ b/gcc/tree-predcom.c @@ -1409,7 +1409,21 @@ ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts) addr, alias_ptr), DECL_SIZE (field), bitsize_zero_node); } - else + /* Generate an ARRAY_REF for array references when all details are INTEGER_CST + rather than a MEM_REF so that CSE passes can potientially optimize them. */ + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF + && TREE_CODE (DR_STEP (dr)) == INTEGER_CST + && TREE_CODE (DR_INIT (dr)) == INTEGER_CST + && TREE_CODE (DR_OFFSET (dr)) == INTEGER_CST) + { + /* Reverse engineer the element from memory offset. */ + tree offset = size_binop (MINUS_EXPR, coff, off); + tree sizdiv = TYPE_SIZE (TREE_TYPE (TREE_TYPE (DR_BASE_OBJECT (dr)))); + sizdiv = div_if_zero_remainder (EXACT_DIV_EXPR, sizdiv, ssize_int (BITS_PER_UNIT)); + tree element = div_if_zero_remainder (EXACT_DIV_EXPR, offset, sizdiv); + if (element != NULL_TREE) + return build4 (ARRAY_REF, TREE_TYPE (DR_REF (dr)), DR_BASE_OBJECT (dr), + element, NULL_TREE, NULL_TREE); + } return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr); be an appropriate start to fixing this? That fix appears to work in in my testing. Thanks, Simon -----Original Message----- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 31 August 2015 11:40 To: Jeff Law Cc: Simon Dardis; gcc@gcc.gnu.org Subject: Re: Predictive commoning leads to register to register moves through memory. On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <l...@redhat.com> wrote: > On 08/28/2015 09:43 AM, Simon Dardis wrote: > >> Following Jeff's advice[1] to extract more information from GCC, I've >> narrowed the cause down to the predictive commoning pass inserting >> the load in a loop header style basic block. However, the next pass >> in GCC, tree-cunroll promptly removes the loop and joins the loop >> header to the body of the (non)loop. More oddly, disabling >> conditional store elimination pass or the dominator optimizations >> pass or disabling of jump-threading with --param >> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any >> ideas on an approach for this issue? > > I'd probably start by looking at the .optimized tree dump in both > cases to understand the difference, then (most liklely) tracing that > through the RTL optimizers into the register allocator. It's the known issue of LIM (here the one after pcom and complete unrolling of the inner loop) being too aggressive with store-motion. Here the comptete array is replaced with registers for the outer loop. Were 'poly' a local variable we'd have optimized it away completely. <bb 6>: _8 = 1.0e+0 / pretmp_42; _12 = _8 * _8; poly[1] = _12; <bb 7>: # prephitmp_30 = PHI <_12(6), _36(9)> # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B]; _2 = prephitmp_30 * poly_I_lsm0.10_38; _54 = _2 * poly_I_lsm0.10_38; _67 = poly_I_lsm0.10_38 * _54; _80 = poly_I_lsm0.10_38 * _67; _93 = poly_I_lsm0.10_38 * _80; _106 = poly_I_lsm0.10_38 * _93; _19 = poly_I_lsm0.10_38 * _106; count_23 = count_28 + 1; if (count_23 != iterations_6(D)) goto <bb 5>; else goto <bb 8>; <bb 8>: poly[2] = _2; poly[3] = _54; poly[4] = _67; poly[5] = _80; poly[6] = _93; poly[7] = _106; poly[8] = _19; i1 = 9; T = T_lsm.8_22; note that DOM misses to CSE poly[1] (a known defect), but heh, doing that would only increase register pressure even more. Note the above is on x86_64. Richard. > jeff