On 17/11/15 11:05, Richard Biener wrote:
On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <tom_devr...@mentor.com> wrote:
On 16/11/15 13:45, Richard Biener wrote:
+ NEXT_PASS (pass_scev_cprop);
What's that for? It's supposed to help removing loops - I don't
expect kernels to vanish.
I'm using pass_scev_cprop for the "final value replacement"
functionality.
Added comment.
That functionality is intented to enable loop removal.
Let me try to explain in a bit more detail.
I.
Consider a parloops testcase test.c, with a use of the final value of the
iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
int i;
for (i = 0; i < n; ++i)
a[i] = 1;
return i;
}
...
Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...
We can see here in the parloops dump-file that the loop was parallelized:
...
SUCCESS: may be parallelized
...
Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit: value i_10 used outside loop
checking if it a part of reduction pattern:
FAILED: it is not a part of reduction.
...
Auto-parallelization fails in this case because there is a loop exit phi
(the one in bb 6 defining i_1) which is not part of a reduction:
...
<bb 4>:
# i_13 = PHI <0(3), i_10(5)>
_5 = (long unsigned int) i_13;
_6 = _5 * 4;
_8 = a_7(D) + _6;
*_8 = 1;
i_10 = i_13 + 1;
if (n_4(D) > i_10)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
# i_1 = PHI <i_10(4)>
_20 = (unsigned int) i_1;
...
With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
i_1 = PHI <i_10(4)>
with
i_1 = n_4(D);
...
And the resulting loop no longer has any loop exit phis, so
auto-parallelization succeeds:
...
<bb 4>:
# i_13 = PHI <0(3), i_10(5)>
_5 = (long unsigned int) i_13;
_6 = _5 * 4;
_8 = a_7(D) + _6;
*_8 = 1;
i_10 = i_13 + 1;
if (n_4(D) > i_10)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
_20 = (unsigned int) n_4(D);
...
[ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
outside loop", for a slightly different testcase where despite the final
value replacement autopar still fails. ]
II.
Now, back to oacc kernels.
Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
subroutine foo(n)
implicit none
integer :: n
integer, dimension (0:n-1) :: a, b, c
integer :: i, ii
do i = 0, n - 1
a(i) = i * 2
end do
do i = 0, n -1
b(i) = i * 4
end do
!$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
do ii = 0, n - 1
c(ii) = a(ii) + b(ii)
end do
!$acc end kernels
do i = 0, n - 1
if (c(i) .ne. a(i) + b(i)) call abort
end do
end subroutine foo
end module test
...
The loop at the start of the kernels pass group contains an in-memory
iteration variable, with a store to '*_9 = _38'.
...
<bb 4>:
_13 = *.omp_data_i_4(D).c;
c.21_14 = *_13;
_16 = *_9;
_17 = (integer(kind=8)) _16;
_18 = *.omp_data_i_4(D).a;
a.22_19 = *_18;
_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
_24 = *.omp_data_i_4(D).b;
b.23_25 = *_24;
_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
_30 = _23 + _29;
MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
_38 = _16 + 1;
*_9 = _38;
if (_8 == _16)
goto <bb 3>;
else
goto <bb 4>;
...
After pass_lim/pass_copy_prop, we've rewritten that into using a local
iteration variable, but we've generated a read of the final value of the
iteration variable outside the loop, which means auto-parallelization will
fail:
...
<bb 5>:
# D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
_17 = (integer(kind=8)) D__lsm.29_12;
_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
_30 = _23 + _29;
MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
_38 = D__lsm.29_12 + 1;
if (_8 == D__lsm.29_12)
goto <bb 6>;
else
goto <bb 7>;
<bb 6>:
# D__lsm.29_27 = PHI <_38(5)>
*_9 = D__lsm.29_27;
goto <bb 3>;
So this store is not actually necessary?
a.
In the case of this example, the store is dead.
There is a corresponding load at the point that we split off the region:
...
<bb 9>:
#pragma omp return
<bb 10>:
D.3635 = .omp_data_arr.25.ii;
ii = *D.3635;
...
This load is later removed, given that ii is unused after the region.
But once the region is split off, there's nothing in the context of the
store to suggest that it's dead.
And to get rid of the load of ii before the region is split off, we
would have to implement some sort of liveness analysis on pre-ssa code.
b.
There's the case where there is an explicit use of ii after the region,
in which case the store is not dead.
c.
And there's the case were we use a data clause on the region, f.i.
'create (ii)' to indicate that the variable is neither copied in nor
copied out of the region (the default for a scalar in a kernels region
is 'copy', meaning copy-in-and-out).
[ This means the value of ii after the region is uninitialized. So even
if there's a read from ii after the region, we cannot consider it
connected to the store, given that the value written by the store on the
accelerator will not be copied back to the host. ]
In this case, we already don't have any load of ii after the region:
...
<bb 9>:
#pragma omp return
<bb 10>:
.omp_data_sizes.28 = {CLOBBER};
.omp_data_arr.27 = {CLOBBER};
...
We could insert clobbers for the bits of .omp_data_arr at the end of the
region to indicate that those are not used. That might enable dse to get
rid of the dead store.
But, I think we want a generic solution that handles cases a, b and c,
which means we have to solve the most difficult case, which is b, where
the store is not dead.
Or just in an inconvenient place?
I don't think the place of the store is inconvenient, it would be worse
to have the store in the loop.
What is inconvenient about the store is the fact that it reads the final
value of the iteration variable (which inhibits parloops).
<bb 7>:
goto <bb 5>;
...
This makes it similar to the parloops example above, and that's why I've
added pass_scev_cprop in the kernels pass group.
[ And for some kernels test-cases with constant loop bound, it's not the
final value replacement bit that does the substitution, but the first bit in
scev_const_prop using resolve_mixers. So that's a related reason to use
pass_scev_cprop. ]
IMHO autopar needs to handle induction itself.
I'm not sure what you mean. Could you elaborate? Autopar handles
induction variables, but it doesn't handle exit phis reading the final
value of the induction variable. Is that what you want fixed? How?
And the above LIM example
is none for why you need two LIM passes...
Indeed. I'm planning a separate reply to explain in more detail the need
for the two pass_lims.
Thanks,
- Tom