Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Richard Guenther Tue, 21 Feb 2012 01:27:42 -0800

On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <[email protected]> wrote:
> Hello,
>
> we've noticed that the loop optimizer sometimes inserts weirdly
> inefficient code to compute the value of an induction variable
> at the end of the loop.
>
> As a test case stripped down to the core of the problem, consider:
>
> int test (int start, int end)
> {
>  int i;
>
>  for (i = start; i < end; i++)
>    ;
>
>  return i;
> }
>
> That function really ought to be equivalent to
>
> int test2 (int start, int end)
> {
>  return start < end ? end : start;
> }
>
> And indeed on i386-linux-gnu, we get mostly identical code
> (except for the branch direction prediction).
>
> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
> we see for the first function:
>
> test:
>        cmp     r0, r1
>        bge     .L2
>        adds    r3, r0, #1
>        mvns    r0, r0
>        adds    r1, r1, r0
>        adds    r0, r3, r1
> .L2:
>        bx      lr
>
> instead of simply (as for the second function):
>
> test2:
>        cmp     r1, r0
>        it      ge
>        movge   r0, r1
>        bx      lr
>
>
>
> Where does that weird addition-and-negation sequence come from?
> In 100t.dceloop we still have:
>
> <bb 4>:
>  # i_9 = PHI <i_5(5), start_2(D)(3)>
>  i_5 = i_9 + 1;
>  if (end_4(D) > i_5)
>    goto <bb 5>;
>  else
>    goto <bb 6>;
>
> <bb 5>:
>  goto <bb 4>;
>
> <bb 6>:
>  # i_1 = PHI <i_5(4)>
>
>
> But 102t.sccp has:
>
> <bb 4>:
>  # i_9 = PHI <i_5(5), start_2(D)(3)>
>  i_5 = i_9 + 1;
>  if (end_4(D) > i_5)
>    goto <bb 5>;
>  else
>    goto <bb 6>;
>
> <bb 5>:
>  goto <bb 4>;
>
> <bb 6>:
>  D.4123_10 = start_2(D) + 1;
>  D.4124_11 = ~start_2(D);
>  D.4125_12 = (unsigned int) D.4124_11;
>  D.4126_13 = (unsigned int) end_4(D);
>  D.4127_14 = D.4125_12 + D.4126_13;
>  D.4128_15 = (int) D.4127_14;
>  i_1 = D.4123_10 + D.4128_15;
>
> This is the sequence that ends up in the final assembler code.
>
> Note that this computes:
>
>  i_1 = (start_2(D) + 1)
>         + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))
>
> which is (since those casts are no-ops on the assembler level):
>
>  i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)
>
> which is due to two's-complement arithmetic:
>
>  i_1 = start_2(D) - start_2(D) + end_4(D)
>
> that is simply:
>
>  i_1 = end_4(D)
>
>
> Unfortunately, no tree-based optimization pass after 102t.sccp is able to
> clean up that mess.  This is true even on *i386*: the only reason we get
> nice assembler there is due to *RTX* optimization, notably in combine.
> This hits on i386 because of an intermediate stage that adds two registers
> and the constant 1 (which matches the lea pattern).  On arm, there is no
> instruction that would handle that intermediate stage, and combine is not
> powerful enough to perform the whole simplification in a single step.
>
>
> Now that sequence of gimple operations is generated in scev_const_prop
> in tree-scalar-evolution.c.  First, the phi node corresponding to the
> loop exit value is evaluated:
>
>  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);
>
> which returns a chrec of the form
>
>  { 1, +, (start + 1) }
>
> This is then further simplified by
>
>  def = compute_overall_effect_of_inner_loop (ex_loop, def);
>
> which first computes the number of loop iterations:
>
>  tree nb_iter = number_of_latch_executions (loop);
>
> which returns a tree representing
>
>  (unsigned int) ~start + (unsigned int) end
>
> (which at this point seems to be the validly most efficient way to
> compute end - start - 1).
>
> The chrec is then evaluated via chrec_apply which basically computes
>
>  (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)
>
> which ends up being converted to the long gimple sequence seen above.
>
>
> Now I guess my questions are:
>
> - In the computation shown above, should that tree have been folded
>  into just "end" to start with?  The tree is constructed at its


The issue is that (start + 1) + 1 * (int) ... is computed using signed
integer arithmetic which may invoke undefined behavior when it wraps.
Thus we cannot re-associate it.  I suppose if you'd arrange the code
to do the arithmetic in unsigned and only cast the result back to the
desired type we might fold it ...

>  outermost level in chrec_fold_plus, which simply calls fold_build2.
>  While simplify-rtx.c has code to detect sequences of operations
>  that cancel out while folding a PLUS or MINUS RTX, there does
>  not appear to be corresponding code in fold-const.c to handle
>  the equivalent optimization at the tree level.

There are.  fold-const.c has some re-association logic to catch this
and we have tree-ssa-reassoc.c.  All of those only handle same-sign
operands though I think.

> - If not, should there be another tree pass later on that ought to
>  clean this up?  I notice there is code to handle plus/minus
>  sequences in tree-ssa-reassoc.c.  But this doesn't seem to be
>  able to handle this particular case, possibly because the presence
>  of the casts (nop_exprs) ...
>
> - Anywhere else I'm overlooking right now?
>
>
> I'd appreciate any tips / suggestions how to fix this ...

Look at doing the computation in unsigned arithmetic.

Richard.

>
> Thanks,
> Ulrich
>
> --
>  Dr. Ulrich Weigand
>  GNU Toolchain for Linux on System z and Cell BE
>  [email protected]
>

Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Reply via email to