Hi Cory,

On 2023-11-23 08:35, Cordell Bloor wrote:
> On 2023-11-22 03:19, Christian Kastner wrote:
>>> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
>>> the KConfig for "Enable HMM-based shared virtual memory manager", which
>>> is required for xnack+ operation. The xnack feature allows some AMD GPUs
>>> to retry memory accesses that fail due to a page fault, which is used as
>>> a mechanism for migrating managed memory automatically from host to
>>> device. With xnack disabled, page faults in device code are not
>>> recoverable [1].
>> I've rebuilt our kernel with this option enabled, and the message indeed
>> went away. Great!
>>
>> This also required DEVICE_PRIVATE (and that one also suggests
>> HMM_MIRROR). I don't see any downside to these; should we request them
>> from the Kernel Team?
> 
> I suppose the downside would be that more code means more bugs. I'm not
> sure what inclusion criteria is used by the maintainers, but it seems

you linked to [1] in one of your replies. Under "Supported Hardware",
the article states:

> Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually 
> support XNACK, but only APU platforms enabled it by default. On dedicated 
> graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due 
> to stability concerns as it’s still an experimental feature.
> 
> For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is 
> no longer supported. Only computing cards from the CDNA series has XNACK 
> support, such as Instinct MI100 and MI200 - and they also belong to the 
> GFX900 series.

I don't think the lack of official support is a problem here, evaluating
this is what we have our CI for. We could build an image with a fixed
kernel, and see what happens to tests there.

However, unlikely as it may seem, I'd still like to ask: is there any
risk of negatively affecting the graphics side of this? Can this change
somehow break a regular user's video output?

This is far-fetched, but it's not entirely inconceivable that some
external stack might rely on the current behavior.

As a workaround, I was hoping that setting HSA_XNACK=0 would disable the
check, but it didn't work on my end, unfortunately.

Best,
Christian

> [1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/

Reply via email to