On 10/24/2017 02:55 AM, Jakub Jelinek wrote:

> Poeple from NVidia reported privately unexpected amount of host2dev
> transfers for #pragma omp target*.

Did they mention which program they were testing?

> The code even had comments like:
>                    /* FIXME: Perhaps add some smarts, like if copying
>                       several adjacent fields from host to target, use some
>                       host buffer to avoid sending each var individually.  */
> and the especially bad example of this was the
>       for (i = 0; i < mapnum; i++)
>         {
>           cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
>           /* FIXME: see above FIXME comment.  */
>           gomp_copy_host2dev (devicep,
>                               (void *) (tgt->tgt_start + i * sizeof (void *)),
>                               (void *) &cur_node.tgt_offset, sizeof (void *));
>         }
> loop transfering the addresses or firstprivate_int values to the device
> - where we issued mapnum host2dev transfers each just pointer-sized
> when we could have just prepared all the pointers in an array and host2dev
> copy them all together.
> 
> The following patch implements coalescing of transfers (only those that are
> copied into the freshly allocated device buffer) into one or multiple larger
> transfers.  The patch doesn't coalesce > 32KB transfers or transfers where
> the gap is 4KB or more.  I guess it would be not too hard to do similar
> coalescing for the dev2host transfers that are from a single device mapping,
> though probably far less important than the more common host2dev transfers.

Why did you chose the 32KB and 4KB limits? I wonder if that would have
any impact on firstprivate_int values. If this proves to be effective,
it seems like we should be able to eliminate GOMP_MAP_FIRSTPRIVATE_INT
altogether.

> Tested on x86_64-linux to nvptx-none offloading.

By the way, you can capture all of the calls to the CUDA driver using
nvprof. I usually use this command:

  nvprof --csv --print-gpu-trace --log-file foo.nvprof ./foo.exe

You can then grep for CUDA, which usually represents the
dev2host/host2dev functions.

> Thoughts on this?
> 
> 2017-10-24  Jakub Jelinek  <ja...@redhat.com>
> 
>       * target.c (struct gomp_map_cache): New type.
>       (gomp_cache_add): New function.
>       (gomp_copy_host2dev): Add CACHE argument, if copying into
>       the cached ranges, memcpy into buffer instead of copying
>       into device.
>       (gomp_map_vars_existing, gomp_map_pointer, gomp_map_fields_existing):
>       Add CACHE argument, pass it through to other calls.
>       (gomp_map_vars): Aggregate copies from host to device if small enough
>       and with small enough gaps in between into memcpy into a buffer and
>       fewer host to device copies from the buffer.
>       (gomp_update): Adjust gomp_copy_host2dev caller.
> 
> --- libgomp/target.c.jj       2017-04-20 14:59:08.296263304 +0200
> +++ libgomp/target.c  2017-10-23 19:08:14.348336118 +0200
> @@ -177,10 +177,77 @@ gomp_device_copy (struct gomp_device_des
>      }
>  }
>  
> +struct gomp_map_cache
> +{
> +  void *buf;
> +  struct target_mem_desc *tgt;
> +  size_t *chunks;
> +  long chunk_cnt;
> +  long use_cnt;
> +};
> +

Maybe include a comment here stating that you want to restrict caching
to 32KB with variables with no gaps larger than 4KB?

> +static inline void
> +gomp_cache_add (struct gomp_map_cache *cache, size_t start, size_t len)
> +{
> +  if (len > 32 * 1024 || len == 0)
> +    return;
> +  if (cache->chunk_cnt)
> +    {
> +      if (cache->chunk_cnt < 0)
> +     return;
> +      if (start < cache->chunks[2 * cache->chunk_cnt - 1])
> +     {
> +       cache->chunk_cnt = -1;
> +       return;
> +     }
> +      if (start < cache->chunks[2 * cache->chunk_cnt - 1] + 4 * 1024)
> +     {
> +       cache->chunks[2 * cache->chunk_cnt - 1] = start + len;
> +       cache->use_cnt++;
> +       return;
> +     }
> +      /* If the last chunk is only used by one mapping, discard it,
> +      as it will be one host to device copy anyway and
> +      memcpying it around will only waste cycles.  */
> +      if (cache->use_cnt == 1)
> +     cache->chunk_cnt--;
> +    }
> +  cache->chunks[2 * cache->chunk_cnt] = start;
> +  cache->chunks[2 * cache->chunk_cnt + 1] = start + len;
> +  cache->chunk_cnt++;
> +  cache->use_cnt = 1;
> +}

I'll need to swap in more state to review the rest of the patch, but I
like the idea.

One other minor optimization, would be to change arguments to offloaded
functions from a single struct to individual arguments. At least for
nvptx, cuLaunchKernel accepts variable arguments for PTX kernels. There
are two advantages of this. 1) At least with nvptx, nvptx_exec wouldn't
need to reserve a block of device memory for struct argument. 2) This
would eliminate one level of indirection for each offloaded argument
(although SRA probably takes care of the majority of this already).

Cesar

Reply via email to