On Tue, 15 Dec 2020 18:00:36 +0100
Jakub Jelinek <ja...@redhat.com> wrote:

> On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:
> > > Do you need to hold the omp_stacks.lock across the entire
> > > offloading? Doesn't that serialize all offloading kernels to the
> > > same device? I mean, can't the lock be taken just shortly at the
> > > start to either acquire the cached stacks or allocate a fresh
> > > stack, and then at the end to put the stack back into the cache?  
> > 
> > I think you're suggesting something like what Alexander mentioned
> > -- a pool of cached stacks blocks in case the single, locked block
> > is contested. Obviously at present kernel launches are serialised
> > on the target anyway, so it's a question of whether having the
> > device wait for the host to unlock the stacks block (i.e. a context
> > switch, FSVO context switch), or allocating a new stacks block, is
> > quicker. I think the numbers posted in the parent email show that
> > memory allocation is so slow that just waiting for the lock wins.
> > I'm wary of adding unnecessary complication, especially if it'll
> > only be exercised in already hard-to-debug cases (i.e. lots of
> > threads)!  
> 
> I'm not suggesting to have multiple stacks, on the contrary.  I've
> suggested to do the caching only if at most one host thread is
> offloading to the device.
> 
> If one uses
> #pragma omp parallel num_threads(3)
> {
>   #pragma omp target
>   ...
> }
> then I don't see what would previously prevent the concurrent
> offloading, yes, we take the device lock during gomp_map_vars and
> again during gomp_unmap_vars, but don't hold it across the offloading
> in between.

I still don't think I quite understand what you're getting at.

We only implement synchronous launches for OpenMP on NVPTX at present,
and those all use the default CUDA runtime driver stream. Only one
kernel executes on the hardware at once, even if launched from
different host threads. The serialisation isn't due to the device lock
being held, but by the queueing semantics of the underlying API.

> > Does target-side memory allocation call back into the plugin's
> > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> > memory allocation shouldn't be affected, I don't think?  
> 
> Again, I'm not suggesting that it should, but what I'm saying is that
> if target region ends but some other host tasks are doing target
> regions to the same device concurrently with that, or if there are
> async target in fly, we shouldn't try to cache the stack, but free it
> right away, because what the other target regions might need to
> malloc larger amounts of memory and fail because of the caching.

I'm assuming you're not suggesting fundamentally changing APIs or
anything to determine if we're launching target regions from multiple
threads at once, but instead that we try to detect the condition
dynamically in the plugin?

So, would kernel launch look something like this? (Excuse
pseudo-code-isms!)

void GOMP_OFFLOAD_run (...)
{
  bool used_cache;

  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
  if (&ptx_dev->omp_stacks.usage_count > 0)
  {
    cuCtxSynchronize ();
    nvptx_stacks_free (&ptx_dev);
    ...allocate fresh stack, no caching...
    used_cache = false;
  }
  else
  {
    /* Allocate or re-use cached stacks, and then... */
    ptx_dev->omp_stacks.usage_count++;
    used_cache = true;
  }
  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);

  /* Launch kernel */

  if (used_cache) {
    cuStreamAddCallback (
      pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
      ptx_dev->omp_stacks.usage_count--;
      pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
    );
  } else {
    pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
    /* Free uncached stack */
    pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
  }
}

This seems like it'd be rather fragile to me, and would offer some
benefit perhaps only if a previous cached stacks block was much larger
than the one required for some given later launch. It wouldn't allow
any additional parallelism on the target I don't think.

Is that sort-of what you meant?

Oh, or perhaps something more like checking cuStreamQuery at the end of
the kernel launch to see if more work (...from other threads) is
outstanding on the same queue? I think that only usefully returns
CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.

Thanks for clarification (& apologies for being slow!),

Julian

Reply via email to