On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote: > Hello, > > we've noticed that the loop optimizer sometimes inserts weirdly > inefficient code to compute the value of an induction variable > at the end of the loop. > > As a test case stripped down to the core of the problem, consider: > > int test (int start, int end) > { > int i; > > for (i = start; i < end; i++) > ; > > return i; > } > > That function really ought to be equivalent to > > int test2 (int start, int end) > { > return start < end ? end : start; > } > > And indeed on i386-linux-gnu, we get mostly identical code > (except for the branch direction prediction). > > However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a), > we see for the first function: > > test: > cmp r0, r1 > bge .L2 > adds r3, r0, #1 > mvns r0, r0 > adds r1, r1, r0 > adds r0, r3, r1 > .L2: > bx lr > > instead of simply (as for the second function): > > test2: > cmp r1, r0 > it ge > movge r0, r1 > bx lr > > > > Where does that weird addition-and-negation sequence come from? > In 100t.dceloop we still have: > > <bb 4>: > # i_9 = PHI <i_5(5), start_2(D)(3)> > i_5 = i_9 + 1; > if (end_4(D) > i_5) > goto <bb 5>; > else > goto <bb 6>; > > <bb 5>: > goto <bb 4>; > > <bb 6>: > # i_1 = PHI <i_5(4)> > > > But 102t.sccp has: > > <bb 4>: > # i_9 = PHI <i_5(5), start_2(D)(3)> > i_5 = i_9 + 1; > if (end_4(D) > i_5) > goto <bb 5>; > else > goto <bb 6>; > > <bb 5>: > goto <bb 4>; > > <bb 6>: > D.4123_10 = start_2(D) + 1; > D.4124_11 = ~start_2(D); > D.4125_12 = (unsigned int) D.4124_11; > D.4126_13 = (unsigned int) end_4(D); > D.4127_14 = D.4125_12 + D.4126_13; > D.4128_15 = (int) D.4127_14; > i_1 = D.4123_10 + D.4128_15; > > This is the sequence that ends up in the final assembler code. > > Note that this computes: > > i_1 = (start_2(D) + 1) > + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) > > which is (since those casts are no-ops on the assembler level): > > i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) > > which is due to two's-complement arithmetic: > > i_1 = start_2(D) - start_2(D) + end_4(D) > > that is simply: > > i_1 = end_4(D) > > > Unfortunately, no tree-based optimization pass after 102t.sccp is able to > clean up that mess. This is true even on *i386*: the only reason we get > nice assembler there is due to *RTX* optimization, notably in combine. > This hits on i386 because of an intermediate stage that adds two registers > and the constant 1 (which matches the lea pattern). On arm, there is no > instruction that would handle that intermediate stage, and combine is not > powerful enough to perform the whole simplification in a single step. > > > Now that sequence of gimple operations is generated in scev_const_prop > in tree-scalar-evolution.c. First, the phi node corresponding to the > loop exit value is evaluated: > > def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); > > which returns a chrec of the form > > { 1, +, (start + 1) } > > This is then further simplified by > > def = compute_overall_effect_of_inner_loop (ex_loop, def); > > which first computes the number of loop iterations: > > tree nb_iter = number_of_latch_executions (loop); > > which returns a tree representing > > (unsigned int) ~start + (unsigned int) end > > (which at this point seems to be the validly most efficient way to > compute end - start - 1). > > The chrec is then evaluated via chrec_apply which basically computes > > (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) > > which ends up being converted to the long gimple sequence seen above. > > > Now I guess my questions are: > > - In the computation shown above, should that tree have been folded > into just "end" to start with? The tree is constructed at its
The issue is that (start + 1) + 1 * (int) ... is computed using signed integer arithmetic which may invoke undefined behavior when it wraps. Thus we cannot re-associate it. I suppose if you'd arrange the code to do the arithmetic in unsigned and only cast the result back to the desired type we might fold it ... > outermost level in chrec_fold_plus, which simply calls fold_build2. > While simplify-rtx.c has code to detect sequences of operations > that cancel out while folding a PLUS or MINUS RTX, there does > not appear to be corresponding code in fold-const.c to handle > the equivalent optimization at the tree level. There are. fold-const.c has some re-association logic to catch this and we have tree-ssa-reassoc.c. All of those only handle same-sign operands though I think. > - If not, should there be another tree pass later on that ought to > clean this up? I notice there is code to handle plus/minus > sequences in tree-ssa-reassoc.c. But this doesn't seem to be > able to handle this particular case, possibly because the presence > of the casts (nop_exprs) ... > > - Anywhere else I'm overlooking right now? > > > I'd appreciate any tips / suggestions how to fix this ... Look at doing the computation in unsigned arithmetic. Richard. > > Thanks, > Ulrich > > -- > Dr. Ulrich Weigand > GNU Toolchain for Linux on System z and Cell BE > ulrich.weig...@de.ibm.com >