On 07/06/2022 13:10, Jakub Jelinek wrote:
On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
Following some feedback from users of the OG11 branch I think I need to
withdraw this patch, for now.

The memory pinned via the mlock call does not give the expected performance
boost. I had not expected that it would do much in my test setup, given that
the machine has a lot of RAM and my benchmarks are small, but others have
tried more and on varying machines and architectures.

I don't understand why there should be any expected performance boost (at
least not unless the machine starts swapping out pages),
{ omp_atk_pinned, true } is solely about the requirement that the memory
can't be swapped out.

It seems like it takes a faster path through the NVidia drivers. This is a black box, for me, but that seems like a plausible explanation. The results are different on x86_64 and powerpc hosts (such as the Summit supercomputer).

It seems that it isn't enough for the memory to be pinned, it has to be
pinned using the Cuda API to get the performance boost. I had not done this

For performance boost of what kind of code?
I don't understand how Cuda API could be useful (or can be used at all) if
offloading to NVPTX isn't involved.  The fact that somebody asks for host
memory allocation with omp_atk_pinned set to true doesn't mean it will be
in any way related to NVPTX offloading (unless it is in NVPTX target region
obviously, but then mlock isn't available, so sure, if there is something
CUDA can provide for that case, nice).

This is specifically for NVPTX offload, of course, but then that's what our customer is paying for.

The expectation, from users, is that memory pinning will give the benefits specific to the active device. We can certainly make that happen when there is only one (flavour of) offload device present. I had hoped it could be one way for all, but it looks like not.


I don't think libmemkind will resolve this performance issue, although
certainly it can be used for host implementations of low-latency memories,
etc.

The reason for libmemkind is primarily its support of HBW memory (but
admittedly I need to find out what kind of such memory it does support),
or the various interleaving etc. the library has.
Plus, when we have such support, as it has its own costomizable allocator,
it could be used to allocate larger chunks of memory that can be mlocked
and then just allocate from that pinned memory if user asks for small
allocations from that memory.

It should be straight-forward to switch the no-offload implementation to libmemkind when the time comes (the changes would be contained within config/linux/allocator.c), but I have no plans to do so myself (and no hardware to test it with). I'd prefer that it didn't impede the offload solution in the meantime.

Andrew

Reply via email to