On 10/30/2017 08:15 AM, Jakub Jelinek wrote:
On Fri, Oct 27, 2017 at 03:57:28PM +0200, Tom de Vries wrote:
how about this approach:
1 - Move async_run from plugin-hsa.c to default_async_run
2 - Implement omp async support for nvptx
?
The first patch moves the GOMP_OFFLOAD_async_run implementation from
plugin-hsa.c to target.c, making it the default implementation if the plugin
does not define the GOMP_OFFLOAD_async_run symbol.
The second patch removes the GOMP_OFFLOAD_async_run symbol from the nvptx
plugin, activating the default implementation, and makes sure
GOMP_OFFLOAD_run can be called from a fresh thread.
I've tested this with libgomp.c/c.exp and the previously failing target-33.c
and target-34.c are now passing, and there are no regressions.
OK for trunk after complete testing (and adding function comment for
default_async_run)?
Can't PTX do better than this?
It can.
I found your comment describing this implementation as a hack here (
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02726.html ) after sending
this on Friday, and thought about things a little bit more. So let me
try again.
This is not an optimal nvptx async implementation. This is a proposal to
have a poor man's async implementation in the common code, rather than
having libgomp accel ports implementing GOMP_OFFLOAD_async_run as abort
at first.
AFAIU, the purpose of the async functionality is to have jobs executed
concurrently and/or interleaved on the device. While this implementation
does not offer jobs to the device in separate queues, such that the
device can decide on concurrent and interleaved behaviour, it does
present the device with a possibly interleaved job schedule (which is
slightly better than having a poor mans async implementation that is
just synchronous).
In order to have an optimal implementation, one would still need to
implement the GOMP_OFFLOAD_async_run hook, which would bypass this
implementation.
I'm not sure how useful this would be, but I can even imagine using this
if all the accel ports have implemented the GOMP_OFFLOAD_async_run hook.
We could define a variable OMP_ASYNC with semantics:
- 0: ignore plugins GOMP_OFFLOAD_async_run hook, fall back on
synchronous behaviour
- 1: ignore plugins GOMP_OFFLOAD_async_run hook, use poor man's
implementation.
- 2: use plugins GOMP_OFFLOAD_async_run hook.
This could be helpful in debugging programs with async behaviour.
What I mean is that while we probably need
to take the device lock for the possible memory transfers and deallocation
at the end of the region and thus perform some action on the host in between
the end of the async target region and data copying/deallocation, can't we
have a single thread per device instead of one thread per async target
region, use CUDA async APIs and poll for all the pending async regions
together? I mean, if we need to take the device lock, then we need to
serialize the finalization anyway and reusing the same thread would
significantly decrease the overhead if there are many async regions.
As for the poor mans implementation, it is indeed inefficient and could
be improved, but I wonder if it's worth the effort. [ Perhaps though for
debugging purposes the ability to control the interleaving in some way
might be useful. ]
I imagine that an efficient nvptx implementation will use cuda streams,
which are queues where both kernels and mem transfers can be queued. So
rather than calling GOMP_PLUGIN_target_task_completion once the kernel
is done, it would be more efficient to be able call a similar function
that schedules the data transfers that need to happen, without assuming
that the kernel is already done. However, AFAIU, that won't take care of
deallocation. So I guess the first approach will be to use cuda events
to poll whether a kernel has completed, and then call
GOMP_PLUGIN_target_task_completion.
And, if it at least in theory can do better than that, then even if we
punt on that for now due to time/resource constraints, maybe it would be
better to do this inside of plugin where it can be more easily replaced
later.
I'm trying to argue the other way round: if there is no optimal
implementation in the plugin, let's provide at least a non-optimal but
non-synchronous implementation as default, and exercise the async code
rather than having tests fail with a plugin abort.
Thanks,
- Tom