Hi! On 2023-12-07T13:43:17+0000, Andrew Stubbs <a...@codesourcery.com> wrote: > @Thomas, there are questions for you below....
It's been a while that I've been working on this; I'll try to produce some coherent answers now. > On 22/11/2023 17:07, Tobias Burnus wrote: >> Let's start with the patch itself: >>> --- a/libgomp/target.c >>> +++ b/libgomp/target.c >>> ... >>> +static struct gomp_device_descr * >>> +get_device_for_page_locked (void) >>> +{ >>> + gomp_debug (0, "%s\n", >>> + __FUNCTION__); >>> + >>> + struct gomp_device_descr *device; >>> +#ifdef HAVE_SYNC_BUILTINS >>> + device >>> + = __atomic_load_n (&device_for_page_locked, MEMMODEL_RELAXED); >>> + if (device == (void *) -1) >>> + { >>> + gomp_debug (0, " init\n"); >>> + >>> + gomp_init_targets_once (); >>> + >>> + device = NULL; >>> + for (int i = 0; i < num_devices; ++i) >> >> Given that this function just sets a single variable based on whether the >> page_locked_host_alloc_func function pointer exists, wouldn't it be much >> simpler to just do all this handling in gomp_target_init ? > > @Thomas, care to comment on this? >From what I remember, we cannot assume that 'gomp_target_init' has already been done when we get here; therefore 'gomp_init_targets_once' is being called here. We may get to 'get_device_for_page_locked' via host-side OpenMP, in code that doesn't contain any OpenMP 'target' offloading things. Therefore, this was (a) necessary to make that work, and (b) did seem to be a useful abstraction to me. >>> + for (int i = 0; i < num_devices; ++i) >>> ... >>> + /* We consider only the first device of potentially several of the >>> + same type as this functionality is not specific to an individual >>> + offloading device, but instead relates to the host-side >>> + implementation of the respective offloading implementation. */ >>> + if (devices[i].target_id != 0) >>> + continue; >>> + >>> + if (!devices[i].page_locked_host_alloc_func) >>> + continue; >>> ... >>> + if (device) >>> + gomp_fatal ("Unclear how %s and %s libgomp plugins may" >>> + " simultaneously provide functionality" >>> + " for page-locked memory", >>> + device->name, devices[i].name); >>> + else >>> + device = &devices[i]; >> >> I find this a bit inconsistent: If - let's say - GCN does not not >> provide its >> own pinning, the code assumes that CUDA pinning is just fine. However, >> if both >> support it, CUDA pinning suddenly is not fine for GCN. > > I think it means that we need to revisit this code if that situation > ever occurs. Again, @Thomas? That's correct. As you know, I don't like doing half-baked things. ;-) However, this did seem like a useful stepping-stone to me, to get such a thing implemented at all; we do understand that this won't handle all (future) cases, thus the 'gomp_fatal' to catch that loudly. Once we are in the situation where we have multiple ways of allocating large amounts of pinned memory, we'll have to see how to deal with that. (May, of course, already now work out how conceptually that should be done, possibly via OpenMP committee/specification work, if necessary?) (As for the future implementation, maybe *allocate* via one device, and then *register* the allocation with the other devices.) >> Additionally, all wording suggests that it does not matter for CUDA for >> which >> device access we want to optimize the pinning. But the code above also >> fails if >> I have a system with two Nvidia cards. Why/how does the code fail in that case? Assuming I understood the question correctly, the 'if (devices[i].target_id != 0) continue;' is meant to handle that case. >> From the wording, it sounds as >> if just >> checking whether the device->type is different would do. Maybe, but I'm not sure I follow what exactly you mean. >> But all in all, I wonder whether it wouldn't be much simpler to state >> something >> like the following (where applicable): >> >> If first device that provided pinning support is used; the assumption is >> that >> all other devices "of the same kind" or also "of different kinds"? >> and the host can access this memory without measurable >> performance penalty compared to a normal page lock and that having multiple >> device types or host/device NUMA aware pinning support in the plugin is not >> available. If I understood you correctly, that, however, is not correct: if you (hypothetically) allocate pinned memory via GCN (or even the small amount you get via the host), then yes, a nvptx device will be able to access it, but it won't see the performance gains that you'd get if you had allocated via nvptx. (You'll need to register existing memory regions with the nvptx device/CUDA, if I offhand remember correctly, which is subject to later work.) Hopefully that did help? Grüße Thomas >> NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable, >> device-set >> memory spaces this might need to be revisited. > > This seems reasonable to me, until the user can specify. > > (I'm going to go look at the other review points now....) > > Andrew ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955