On Fri, Oct 27, 2017 at 03:57:28PM +0200, Tom de Vries wrote:
> how about this approach:
> 1 - Move async_run from plugin-hsa.c to default_async_run
> 2 - Implement omp async support for nvptx
> ?
> 
> The first patch moves the GOMP_OFFLOAD_async_run implementation from
> plugin-hsa.c to target.c, making it the default implementation if the plugin
> does not define the GOMP_OFFLOAD_async_run symbol.
> 
> The second patch removes the GOMP_OFFLOAD_async_run symbol from the nvptx
> plugin, activating the default implementation, and makes sure
> GOMP_OFFLOAD_run can be called from a fresh thread.
> 
> I've tested this with libgomp.c/c.exp and the previously failing target-33.c
> and target-34.c are now passing, and there are no regressions.
> 
> OK for trunk after complete testing (and adding function comment for
> default_async_run)?

Can't PTX do better than this?  What I mean is that while we probably need
to take the device lock for the possible memory transfers and deallocation
at the end of the region and thus perform some action on the host in between
the end of the async target region and data copying/deallocation, can't we
have a single thread per device instead of one thread per async target
region, use CUDA async APIs and poll for all the pending async regions
together?  I mean, if we need to take the device lock, then we need to
serialize the finalization anyway and reusing the same thread would
significantly decrease the overhead if there are many async regions.

And, if it at least in theory can do better than that, then even if we
punt on that for now due to time/resource constraints, maybe it would be
better to do this inside of plugin where it can be more easily replaced
later.

        Jakub

Reply via email to