Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Tom de Vries Tue, 17 Nov 2015 06:55:39 -0800

On 17/11/15 11:05, Richard Biener wrote:

On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <[email protected]> wrote:

On 16/11/15 13:45, Richard Biener wrote:


+             NEXT_PASS (pass_scev_cprop);


What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.


I'm using pass_scev_cprop for the "final value replacement"
functionality.
Added comment.

That functionality is intented to enable loop removal.



Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of the
iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
   int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
   SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit:   value i_10 used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi
(the one in bb 6 defining i_1) which is not part of a reduction:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_1 = PHI <i_10(4)>
   _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
   i_1 = PHI <i_10(4)>
   with
   i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so
auto-parallelization succeeds:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
outside loop", for a slightly different testcase where despite the final
value replacement autopar still fails. ]


II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
   subroutine foo(n)
     implicit none
     integer :: n
     integer, dimension (0:n-1) :: a, b, c
     integer                    :: i, ii
     do i = 0, n - 1
        a(i) = i * 2
     end do

     do i = 0, n -1
        b(i) = i * 4
     end do

     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
     do ii = 0, n - 1
        c(ii) = a(ii) + b(ii)
     end do
     !$acc end kernels

     do i = 0, n - 1
        if (c(i) .ne. a(i) + b(i)) call abort
     end do

   end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory
iteration variable, with a store to '*_9 = _38'.
...
   <bb 4>:
   _13 = *.omp_data_i_4(D).c;
   c.21_14 = *_13;
   _16 = *_9;
   _17 = (integer(kind=8)) _16;
   _18 = *.omp_data_i_4(D).a;
   a.22_19 = *_18;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _24 = *.omp_data_i_4(D).b;
   b.23_25 = *_24;
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = _16 + 1;
   *_9 = _38;
   if (_8 == _16)
     goto <bb 3>;
   else
     goto <bb 4>;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local
iteration variable, but we've generated a read of the final value of the
iteration variable outside the loop, which means auto-parallelization will
fail:
...
   <bb 5>:
   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
   _17 = (integer(kind=8)) D__lsm.29_12;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = D__lsm.29_12 + 1;
   if (_8 == D__lsm.29_12)
     goto <bb 6>;
   else
     goto <bb 7>;

   <bb 6>:
   # D__lsm.29_27 = PHI <_38(5)>
   *_9 = D__lsm.29_27;
   goto <bb 3>;


So this store is not actually necessary?


a.
In the case of this example, the store is dead.

There is a corresponding load at the point that we split off the region:
...
  <bb 9>:
  #pragma omp return

  <bb 10>:
  D.3635 = .omp_data_arr.25.ii;
  ii = *D.3635;
...

This load is later removed, given that ii is unused after the region.But once the region is split off, there's nothing in the context of thestore to suggest that it's dead.

And to get rid of the load of ii before the region is split off, wewould have to implement some sort of liveness analysis on pre-ssa code.

b.

There's the case where there is an explicit use of ii after the region,in which case the store is not dead.

c.

And there's the case were we use a data clause on the region, f.i.'create (ii)' to indicate that the variable is neither copied in norcopied out of the region (the default for a scalar in a kernels regionis 'copy', meaning copy-in-and-out).

[ This means the value of ii after the region is uninitialized. So evenif there's a read from ii after the region, we cannot consider itconnected to the store, given that the value written by the store on theaccelerator will not be copied back to the host. ]


In this case, we already don't have any load of ii after the region:
...
  <bb 9>:
  #pragma omp return

  <bb 10>:
  .omp_data_sizes.28 = {CLOBBER};
  .omp_data_arr.27 = {CLOBBER};
...

We could insert clobbers for the bits of .omp_data_arr at the end of theregion to indicate that those are not used. That might enable dse to getrid of the dead store.

But, I think we want a generic solution that handles cases a, b and c,which means we have to solve the most difficult case, which is b, wherethe store is not dead.

 Or just in an inconvenient place?

I don't think the place of the store is inconvenient, it would be worseto have the store in the loop.

What is inconvenient about the store is the fact that it reads the finalvalue of the iteration variable (which inhibits parloops).

   <bb 7>:
   goto <bb 5>;
...

This makes it similar to the parloops example above, and that's why I've
added pass_scev_cprop in the kernels pass group.

[ And for some kernels test-cases with constant loop bound, it's not the
final value replacement bit that does the substitution, but the first bit in
scev_const_prop using resolve_mixers. So that's a related reason to use
pass_scev_cprop. ]


IMHO autopar needs to handle induction itself.

I'm not sure what you mean. Could you elaborate? Autopar handlesinduction variables, but it doesn't handle exit phis reading the finalvalue of the induction variable. Is that what you want fixed? How?

And the above LIM example
is none for why you need two LIM passes...

Indeed. I'm planning a separate reply to explain in more detail the needfor the two pass_lims.


Thanks,
- Tom

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Reply via email to