On Mon, Jun 23, 2014 at 11:49 AM, Bin Cheng <bin.ch...@arm.com> wrote:
> Hi,
> For below simplified case:
>
> #define LEN (32000)
> __attribute__((aligned(16))) float a[LEN],b[LEN];
>
> int foo (int M)
> {
>   for (int i = 0; i < M; i++)
>     a[i+M] = a[i] + b[i];
> }
>
> Compiling it with command like:
> $ aarch64-elf-gcc -O3 -S foo.c -o foo.S -std=c99
>
> The assembly code of vectorized loop is in below form:
>         mov  x1, 0
>         mov  w2, 0
> .L4:
>         ldr     q0, [x1, x3]
>         add     w2, w2, 1
>         ldr     q1, [x1, x4]
>         cmp     w2, w5
>         fadd    v0.4s, v0.4s, v1.4s
>         str     q0, [x6, x1]
>         add     x1, x1, 16
>         bcc     .L4
>
> Induction variable w2 is unnecessary and can be eliminated with x1.  This is
> safe because X1 will never overflow during all iterations of loop. The
> optimal assembly should be like:
>         mov  x1, 0
> .L4:
>         ldr     q0, [x1, x2]
>         ldr     q1, [x1, x4]
>         fadd    v0.4s, v0.4s, v1.4s
>         str     q0, [x5, x1]
>         add     x1, x1, 16
>         cmp     x1, x3
>         bcc     .L4
>
> This case can be handled if we do more complex overflow check on unsigned
> type in function iv_elimination_compare_lt.
>
> Also there is another blocker for the transformation, function
> number_of_iterations_lt calls fold_build2 to build folded form of
> may_be_zero, while iv_elimination_compare_lt only handles simple form tree
> expressions.  It's possible to have iv_elimination_compare_lt to do some
> undo transformation on may_be_zero, but I found it's difficult for cases
> involving signed/unsigned conversion like case loop-41.c.  Since I think
> there is no obvious benefit to fold may_be_zero here (somehow because the
> two operands are already in folded forms), this patch just calls build2_loc
> instead.

But it may fold to true/false, no?

> This optimization is picked up by patch B, but patch A is necessary since
> the check in iv_elimination_compare_lt of two aff_trees isn't enough when
> two different types (one in signed type, the other in unsigned) involved.  I
> have to use tree comparison here instead.  Considering below simple case:
>
> Analyzing # of iterations of loop 5
>   exit condition [1, + , 1](no_overflow) <= i_1
>   bounds on difference of bases: -3 ... 1
>   result:
>     zero if i_1 + 1 < 1
>     # of iterations (unsigned int) i_1, bounded by 2
>   number of iterations (unsigned int) i_1; zero if i_1 + 1 < 1
>
> use 0
>   compare
>   in statement if (S.7_9 > i_1)
>
>   at position
>   type integer(kind=4)
>   base 1
>   step 1
>   is a biv
>   related candidates
>
> candidate 0 (important)
>   var_before ivtmp.28
>   var_after ivtmp.28
>   incremented at end
>   type unsigned int
>   base 0
>   step 1
>
> When GCC trying to eliminate use 0 with cand 0, the miscellaneous trees in
> iv_elimination_compare_lt are like below with i_1 of signed type:
> B:             i_1 + 1
> A:             0
> niter->niter:  (unsigned int)i_1
>
> Apparently, (B-A-1) is "i_1", which doesn't equal to "(unsigned int)i_1".
> Without this patch, it is considered equal to each other.

just looking at this part.  Do you have a testcase that exhibits a difference
when just applying patch A?  So I can have a look here?

>From the code in iv_elimination_compare_lt I can't figure why we'd
end up with i_1 instead of (unsigned int)i_1 as we convert to a common
type.

I suppose the issue may be that at tree_to_aff_combination time
we strip all nops with STRIP_NOPS but when operating on ->rest
via convert/scale or add we do not strip them again.

But then 'nit' should be i_1, not (unsigned int)i_1.  So the analysis above
really doesn't look correct.

Just to make sure we don't paper over an issue in tree-affine.c.

Thus - testcase?  On x86 we don't run into this place in
iv_elimination_compare_lt (on an unpatched tree).

CCing Zdenek for the meat of patch B.

Thanks,
Richard.

> Note that the induction variable IS necessary on 32 bit systems since
> otherwise there is type overflow.
>
> These two patch fix the mentioned problem.
> They pass bootstrap and regression test on x86_64/x86/aarch64/arm, so any
> comments?
>
> Thanks,
> bin
>
> PATCH A)
>
> 2014-06-23  Bin Cheng  <bin.ch...@arm.com>
>
>         * tree-ssa-loop-ivopts.c (iv_elimination_compare_lt): Check number
>         of iteration using tree comparison.
>
> PATCH B)
>
> 2014-06-23  Bin Cheng  <bin.ch...@arm.com>
>
>         * tree-ssa-loop-niter.c (number_of_iterations_lt): Build unfolded
>         form of may_be_zero.
>         * tree-ssa-loop-ivopts.c (iv_nowrap_period)
>         (nowrap_cand_for_loop_niter_p): New functions.
>         (period_greater_niter_exit): New function refactored from
>         may_eliminate_iv.
>         (iv_elimination_compare_lt): New parameter.  Relax overflow check.
>         Handle special forms may_be_zero expression.
>         (may_eliminate_iv): Call period_greater_niter_exit.  Pass new
>         argument for iv_elimination_compare_lt.
>
> gcc/testsuite/ChangeLog
> 2014-06-23  Bin Cheng  <bin.ch...@arm.com>
>
>         * gcc.dg/tree-ssa/loop-40.c: New test.
>         * gcc.dg/tree-ssa/loop-41.c: New test.

Reply via email to