Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

Thomas Schwinge Wed, 15 Feb 2017 13:31:02 -0800

Hi Cesar!

On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis <ce...@codesourcery.com> 
wrote:
> This patch does the followings:
> 
>  * Adjusts the default num_gangs to utilize more of GPU hardware.
>  * Teach libgomp to emit a diagnostic when num_workers isn't supported.
> 
> [...]


Thanks!

> This patch has been applied to gomp-4_0-branch.

For easier review, I'm quoting here your r245393 commit with whitespace
changes ignored:

> --- libgomp/plugin/plugin-nvptx.c
> +++ libgomp/plugin/plugin-nvptx.c
> @@ -917,10 +918,15 @@ nvptx_exec (void (*fn), size_t mapnum, void 
> **hostaddrs, void **devaddrs,
>       seen_zero = 1;
>      }
>  
> -  if (seen_zero)
> -    {
> -      /* See if the user provided GOMP_OPENACC_DIM environment
> -      variable to specify runtime defaults. */
> +  /* Both reg_granuarlity and warp_granuularity were extracted from
> +     the "Register Allocation Granularity" in Nvidia's CUDA Occupancy
> +     Calculator spreadsheet.  Specifically, this required SM_30+
> +     targets.  */
> +  const int reg_granularity = 256;

That is per warp, so a granularity of 256 / 32 = 8 registers per thread.
(Would be strange otherwise.)

> +  const int warp_granularity = 4;
> +
> +  /* See if the user provided GOMP_OPENACC_DIM environment variable to
> +     specify runtime defaults. */
>    static int default_dims[GOMP_DIM_MAX];
>  
>    pthread_mutex_lock (&ptx_dev_lock);
> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void 
> **hostaddrs, void **devaddrs,
>        CUdevice dev = nvptx_thread()->ptx_dev->dev;
>        /* 32 is the default for known hardware.  */
>        int gang = 0, worker = 32, vector = 32;
> -       CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
> +      CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
>  
>        cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>        cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
>        cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>        cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
> +      cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
> +      cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>  
>        if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
>         && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
>         && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
> -           && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev)  == CUDA_SUCCESS)
> +       && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS
> +       && cuDeviceGetAttribute (&rf_size, cu_rf, dev)  == CUDA_SUCCESS
> +       && cuDeviceGetAttribute (&sm_size, cu_sm, dev)  == CUDA_SUCCESS)

Trying to compile this on CUDA 5.5/331.113, I run into:

    [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec':
    [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 
'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in 
this function)
           cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each 
undeclared identifier is reported only once for each function it appears in
    [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 
'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first 
use in this function)
           cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For reference, please see the code handling
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version
of the nvptx_open_device function.

And then, I don't specifically have a problem with discontinuing CUDA 5.5
support, and require 6.5, for example, but that should be a conscious
decision.

> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, 
> void **devaddrs,
>        matches the hardware configuration.  Logical gangs are
>        scheduled onto physical hardware.  To maximize usage, we
>        should guess a large number.  */
> -       if (default_dims[GOMP_DIM_GANG] < 1)
> -         default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;

That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also
known as "default_dims[0]") is used to decide whether to enter this whole
code block, and with that assignment removed, every call of the
nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing,
cuDeviceGetAttribute calls, computations, and so on.  (See "GOMP_DEBUG=1"
output.)

I think this whole code block should be moved into the nvptx_open_device
function, to have it executed once when the device is opened -- after
all, all these are per-device attributes.  (So, it's actually
conceptually incorrect to have this done only once in the nvptx_exec
function, given that this data then is used in the same process by/for
potentially different hardware devices.)

And, one could argue that the GOMP_OPENACC_DIM parsing conceptually
belongs into generic libgomp code, instead of the nvptx plugin.  (But
that aspect can be cleaned up later: currently, the nvptx plugin is the
only one supporting/using it.)

>        /* The worker size must not exceed the hardware.  */
>        if (default_dims[GOMP_DIM_WORKER] < 1
>         || (default_dims[GOMP_DIM_WORKER] > worker && gang))
> @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void 
> **hostaddrs, void **devaddrs,
>      }
>    pthread_mutex_unlock (&ptx_dev_lock);
>  
> +  int reg_used = -1;  /* Dummy value.  */
> +  cuFuncGetAttribute (&reg_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function);

Why need to assign this "dummy value"?

Now, per my understanding, this "function attribute" must be constant for
all calls of that function.  So, shouldn't this be queried once, after
"linking" the code (link_ptx function)?  And indeed, that's what the
trunk version of the nvptx plugin's GOMP_OFFLOAD_load_image function is
doing.

> +
> +  int reg_per_warp = ((reg_used * warp_size + reg_granularity - 1)
> +                   / reg_granularity) * reg_granularity;
> +
> +  int threads_per_sm = (rf_size / reg_per_warp / warp_granularity)
> +    * warp_granularity * warp_size;
> +
> +  if (threads_per_sm > cpu_size)
> +    threads_per_sm = cpu_size;

(It's too late right now for me to follow these computations.)

> +
> +  if (seen_zero)
> +    {
>        for (i = 0; i != GOMP_DIM_MAX; i++)
>       if (!dims[i])
> +       {
> +         if (default_dims[i] > 0)
>             dims[i] = default_dims[i];
> +         else
> +           switch (i) {
> +           case GOMP_DIM_GANG:
> +             dims[i] = 2 * threads_per_sm / warp_size * dev_size;
> +             break;

Please document that constant factor "2", and in fact a comment on that
whole computation certainly wouldn't hurt.  And, as I suggested
internally, wouldn't a (moderately?) higher constant factor instead of
just "2" actually be yet better, to ensure higher occupancy of the
hardware?

> +           case GOMP_DIM_WORKER:
> +           case GOMP_DIM_VECTOR:
> +             dims[i] = warp_size;
> +             break;

Using "warp_size" for "dims[GOMP_DIM_WORKER]" is conceptually wrong.

> +           default:
> +             abort ();
> +           }
> +       }
> +    }
> +
> +  /* Check if the accelerator has sufficient hardware resources to
> +     launch the offloaded kernel.  */
> +  if (dims[GOMP_DIM_WORKER] > 1)
> +    {

Per the previous computation, doesn't it always hold that
"dims[GOMP_DIM_WORKER]" is bigger than one?

> +      int threads_per_block = threads_per_sm > block_size
> +     ? block_size : threads_per_sm;
> +
> +      threads_per_block /= warp_size;

(Again, given that late hour, I'm not trying to follow these
computations.)

> +
> +      if (dims[GOMP_DIM_WORKER] > threads_per_block)
> +     GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
> +                        "to launch '%s'; recompile the program with "
> +                        "'num_workers = %d' on that offloaded region or "
> +                        "'-fopenacc-dim=-:%d'.\n",
> +                        targ_fn->launch->fn, threads_per_block,
> +                        threads_per_block);
>      }

ACK -- until we come up with a better solution.


Grüße
 Thomas

Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

Reply via email to