Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Thomas Schwinge Fri, 08 Dec 2023 06:10:10 -0800

Hi!

On 2023-12-07T13:43:17+0000, Andrew Stubbs <a...@codesourcery.com> wrote:
> @Thomas, there are questions for you below....


It's been a while that I've been working on this; I'll try to produce
some coherent answers now.

> On 22/11/2023 17:07, Tobias Burnus wrote:
>> Let's start with the patch itself:
>>> --- a/libgomp/target.c
>>> +++ b/libgomp/target.c
>>> ...
>>> +static struct gomp_device_descr *
>>> +get_device_for_page_locked (void)
>>> +{
>>> + gomp_debug (0, "%s\n",
>>> +             __FUNCTION__);
>>> +
>>> + struct gomp_device_descr *device;
>>> +#ifdef HAVE_SYNC_BUILTINS
>>> + device
>>> +   = __atomic_load_n (&device_for_page_locked, MEMMODEL_RELAXED);
>>> + if (device == (void *) -1)
>>> +   {
>>> +     gomp_debug (0, " init\n");
>>> +
>>> +     gomp_init_targets_once ();
>>> +
>>> +     device = NULL;
>>> +     for (int i = 0; i < num_devices; ++i)
>>
>> Given that this function just sets a single variable based on whether the
>> page_locked_host_alloc_func function pointer exists, wouldn't it be much
>> simpler to just do all this handling in   gomp_target_init  ?
>
> @Thomas, care to comment on this?

>From what I remember, we cannot assume that 'gomp_target_init' has
already been done when we get here; therefore 'gomp_init_targets_once' is
being called here.  We may get to 'get_device_for_page_locked' via
host-side OpenMP, in code that doesn't contain any OpenMP 'target'
offloading things.  Therefore, this was (a) necessary to make that work,
and (b) did seem to be a useful abstraction to me.

>>> + for (int i = 0; i < num_devices; ++i)
>>> ...
>>> +    /* We consider only the first device of potentially several of the
>>> +       same type as this functionality is not specific to an individual
>>> +       offloading device, but instead relates to the host-side
>>> +       implementation of the respective offloading implementation. */
>>> +    if (devices[i].target_id != 0)
>>> +      continue;
>>> +
>>> +    if (!devices[i].page_locked_host_alloc_func)
>>> +      continue;
>>> ...
>>> +    if (device)
>>> +      gomp_fatal ("Unclear how %s and %s libgomp plugins may"
>>> +                  " simultaneously provide functionality"
>>> +                  " for page-locked memory",
>>> +                  device->name, devices[i].name);
>>> +    else
>>> +      device = &devices[i];
>>
>> I find this a bit inconsistent: If - let's say - GCN does not not
>> provide its
>> own pinning, the code assumes that CUDA pinning is just fine.  However,
>> if both
>> support it, CUDA pinning suddenly is not fine for GCN.
>
> I think it means that we need to revisit this code if that situation
> ever occurs. Again, @Thomas?

That's correct.  As you know, I don't like doing half-baked things.  ;-)
However, this did seem like a useful stepping-stone to me, to get such a
thing implemented at all; we do understand that this won't handle all
(future) cases, thus the 'gomp_fatal' to catch that loudly.

Once we are in the situation where we have multiple ways of allocating
large amounts of pinned memory, we'll have to see how to deal with that.
(May, of course, already now work out how conceptually that should be
done, possibly via OpenMP committee/specification work, if necessary?)
(As for the future implementation, maybe *allocate* via one device, and
then *register* the allocation with the other devices.)

>> Additionally, all wording suggests that it does not matter for CUDA for
>> which
>> device access we want to optimize the pinning. But the code above also
>> fails if
>> I have a system with two Nvidia cards.

Why/how does the code fail in that case?  Assuming I understood the
question correctly, the 'if (devices[i].target_id != 0) continue;' is
meant to handle that case.

>> From the wording, it sounds as
>> if just
>> checking whether the  device->type  is different would do.

Maybe, but I'm not sure I follow what exactly you mean.

>> But all in all, I wonder whether it wouldn't be much simpler to state
>> something
>> like the following (where applicable):
>>
>> If first device that provided pinning support is used; the assumption is
>> that
>> all other devices

"of the same kind" or also "of different kinds"?

>> and the host can access this memory without measurable
>> performance penalty compared to a normal page lock and that having multiple
>> device types or host/device NUMA aware pinning support in the plugin is not
>> available.

If I understood you correctly, that, however, is not correct: if you
(hypothetically) allocate pinned memory via GCN (or even the small amount
you get via the host), then yes, a nvptx device will be able to access
it, but it won't see the performance gains that you'd get if you had
allocated via nvptx.  (You'll need to register existing memory regions
with the nvptx device/CUDA, if I offhand remember correctly, which is
subject to later work.)


Hopefully that did help?


Grüße
 Thomas


>> NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable,
>> device-set
>> memory spaces this might need to be revisited.
>
> This seems reasonable to me, until the user can specify.
>
> (I'm going to go look at the other review points now....)
>
> Andrew
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Reply via email to