Hi Cory, On 2023-11-23 08:35, Cordell Bloor wrote: > On 2023-11-22 03:19, Christian Kastner wrote: >>> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is >>> the KConfig for "Enable HMM-based shared virtual memory manager", which >>> is required for xnack+ operation. The xnack feature allows some AMD GPUs >>> to retry memory accesses that fail due to a page fault, which is used as >>> a mechanism for migrating managed memory automatically from host to >>> device. With xnack disabled, page faults in device code are not >>> recoverable [1]. >> I've rebuilt our kernel with this option enabled, and the message indeed >> went away. Great! >> >> This also required DEVICE_PRIVATE (and that one also suggests >> HMM_MIRROR). I don't see any downside to these; should we request them >> from the Kernel Team? > > I suppose the downside would be that more code means more bugs. I'm not > sure what inclusion criteria is used by the maintainers, but it seems
you linked to [1] in one of your replies. Under "Supported Hardware", the article states: > Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually > support XNACK, but only APU platforms enabled it by default. On dedicated > graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due > to stability concerns as it’s still an experimental feature. > > For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is > no longer supported. Only computing cards from the CDNA series has XNACK > support, such as Instinct MI100 and MI200 - and they also belong to the > GFX900 series. I don't think the lack of official support is a problem here, evaluating this is what we have our CI for. We could build an image with a fixed kernel, and see what happens to tests there. However, unlikely as it may seem, I'd still like to ask: is there any risk of negatively affecting the graphics side of this? Can this change somehow break a regular user's video output? This is far-fetched, but it's not entirely inconceivable that some external stack might rely on the current behavior. As a workaround, I was hoping that setting HSA_XNACK=0 would disable the check, but it didn't work on my end, unfortunately. Best, Christian > [1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/