Hi Chung-Lin! On Tue, 25 Sep 2018 21:11:42 +0800, Chung-Lin Tang <chunglin_t...@mentor.com> wrote: > These are the testsuite/libgomp.oacc-c-c++-common/* changes.
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c > @@ -114,6 +114,7 @@ main (int argc, char **argv) > > for (i = 0; i < N; i++) > { > + stream = (CUstream) acc_get_cuda_stream (i & 1); > r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0); What's the motivation for this change? And then: > @@ -122,11 +123,11 @@ main (int argc, char **argv) > } > } > > - acc_wait_async (0, 1); > - > if (acc_async_test (0) != 0) > abort (); > > + acc_wait_async (0, 1); > + > if (acc_async_test (1) != 0) > abort (); I somehow feel that this change... > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c > @@ -133,7 +133,7 @@ main (int argc, char **argv) > > for (i = 0; i <= N; i++) > { > - if (acc_async_test (i) != 0) > + if (acc_async_test (i) == 0) > abort (); > } ..., and this change are needed because we're now more strictly synchronizing with the local (host) thread. Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently present: [...] for (i = 0; i < N; i++) { r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 0); if (r != CUDA_SUCCESS) { fprintf (stderr, "cuLaunchKernel failed: %d\n", r); abort (); } } This launches N kernels on N separate async queues/CUDA streams, [0..N). acc_wait_all_async (N); Then, the "acc_wait_all_async (N)" -- in my understanding! -- should *not* synchronize with the local (host) thread, but instead just set up the additional async queue/CUDA stream N to "depend" on [0..N). for (i = 0; i <= N; i++) { if (acc_async_test (i) != 0) abort (); } Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still running). acc_wait (N); Here, the "acc_wait (N)" would synchronize the local (host) thread with async queue/CUDA stream N and thus recursively with [0..N). for (i = 0; i <= N; i++) { if (acc_async_test (i) != 1) abort (); } [...] So, then all these async queues/CUDA streams here indeed are "acc_async_test (i) != 1", thas is, idle. Now, the more strict synchronization with the local (host) thread is not wrong in term of correctness, but I suppose it will impact performance of otherwise asynchronous operations, which now get synchronized too much? Or, of course, I'm misunderstanding something... (For avoidance of doubt, I would accept the "async re-work" as is, but we should eventually clarify this, and restore the behavior we -- apparently -- had before, where we didn't synchronize so much? (So, technically, the "async re-work" would constitute a regression for this kind of usage?) Grüße Thomas