On 08/01/2018 04:01 PM, Cesar Philippidis wrote:
> On 08/01/2018 03:18 AM, Tom de Vries wrote:
>> On 07/31/2018 04:58 PM, Cesar Philippidis wrote:
>>> The attached patch teaches libgomp how to use the CUDA thread occupancy
>>> calculator built into the CUDA driver. Despite both being based off the
>>> CUDA thread occupancy spreadsheet distributed with CUDA, the built in
>>> occupancy calculator differs from the occupancy calculator in og8 in two
>>> key ways. First, og8 launches twice the number of gangs as the driver
>>> thread occupancy calculator. This was my attempt at preventing threads
>>> from idling, and it operating on a similar principle of running 'make
>>> -jN', where N is twice the number of CPU threads.
>>
>> You're saying the two methods are different, and that the difference
>> between the two methods is a factor two, which is a heuristic you added
>> yourself on top of one of the methods, which implies that in fact the
>> two methods are identical. Is my understanding correct here?
>
> With the exception being that og8 multiples num_gangs by a factor of
> two, those two algorithms are identical, at least with respect to gangs.
>
>>> Second, whereas og8
>>> always attempts to maximize the CUDA block size, the driver may select a
>>> smaller block, which effectively decreases num_workers.
>>>
>>
>> So, do I understand it correctly that using the function
>> cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can
>> achieve the maximum occupancy" or some such and og8 gives us "maximum
>> block size"?
>
> Correct.
>
>>> In terms of performance, there really isn't that much of a difference
>>> between the CUDA driver's occupancy calculator and og8's. However, on
>>> the tests that are impacted, they are generally within a factor of two
>>> from one another, with some tests running faster with the driver
>>> occupancy calculator and others with og8's.
>>>
>>
>> Ack. Well, until we understand that in more detail, going with the
>> driver's occupancy calculator seems the right thing to do.
>>
>>> Unfortunately, support for the CUDA driver API isn't universal; it's
>>> only available in CUDA version 6.5 (or 6050) and newer. In this patch,
>>> I'm exploiting the fact that init_cuda_lib only checks for errors on the
>>> last library function initialized.
>>
>> That sounds incorrect to me. In init_cuda_lib I see:
>> ...
>> # define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)
>> # define CUDA_ONE_CALL_1(call) \
>> cuda_lib.call = dlsym (h, #call); \
>> if (cuda_lib.call == NULL) \
>> return false;
>> CUDA_CALLS
>> ...
>> so in fact every library function is checked. Have you tested this with
>> pre 6-5 cuda?
>
> I misread that. You're correct. So far, I've only tested this out with
> CUDA 9.
>
>> I think we need to add and handle:
>> ...
>> CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
>> ...
>>
>>> Therefore it guards the usage of
>>>
>>> cuOccupancyMaxPotentialBlockSizeWithFlags
>>>
>>> by checking driver_version.
>>
>> If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we
>> can test for NULL, which seems a simpler solution than testing the version.
>>
>>> If the driver occupancy calculator isn't
>>> available, it falls back to the existing defaults. Maybe the og8 thread
>>> occupancy would make a better default for older versions of CUDA, but
>>> that's a patch for another day.
>>>
>>
>> Agreed.
>>
>>> Is this patch OK for trunk?
>>
>> The patch doesn't build in a setup with
>> --enable-offload-targets=nvptx-none and without cuda, that enables usage
>> of plugin/cuda/cuda.h:
>> ...
>> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:
>> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);
>> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?
>> CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \
>> ...
>>
>>> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void
>>> **hostaddrs, void **devaddrs,
>>>
>>> {
>>> bool default_dim_p[GOMP_DIM_MAX];
>>> + int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];
>>> + int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];
>>> + int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];
>>> +
>>> + /* The CUDA driver occupancy calculator is only available on
>>> + CUDA version 6.5 (6050) and newer. */
>>> + if (nvthd->ptx_dev->driver_version > 6050)
>>> + {
>>> + int grids, blocks;
>>> + CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
>>> + &blocks, function, NULL, 0,
>>> + dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
>>> + GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
>>> + "grid = %d, block = %d\n", grids, blocks);
>>> +
>>
>>
>>> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)
>>
>> You should use gomp_openacc_dims[0].
>>
>>> + gangs = grids * (blocks / warp_size);
>>
>> So, we launch with gangs == grids * workers ? Is that intentional?
>
> Yes. At least that's what I've been using in og8. Setting num_gangs =
> grids alone caused significant slow downs.
>
Well, what you're saying here is: increasing num_gangs increases
performance.
You don't explain why you multiply with workers specifically.
>>> +
>>> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)
>>> + workers = blocks / vectors;
>>
>> Also, the new default calculation is not nicely separated from the
>> fallback default calculation. I've updated the patch with a cleaner
>> separation, attached and build without cuda but untested.
>
> That looks nicer. Did you want to take over this patch?
No.
Thanks,
- Tom