Hi Christian,
On 2023-11-22 03:19, Christian Kastner wrote:
The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
the KConfig for "Enable HMM-based shared virtual memory manager", which
is required for xnack+ operation. The xnack feature allows some AMD GPUs
to retry memory accesses that fail due to a page fault, which is used as
a mechanism for migrating managed memory automatically from host to
device. With xnack disabled, page faults in device code are not
recoverable [1].
I've rebuilt our kernel with this option enabled, and the message indeed
went away. Great!
This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?
I suppose the downside would be that more code means more bugs. I'm not
sure what inclusion criteria is used by the maintainers, but it seems
like a reasonable request.
That did remind me of another message I've seen in dmesg, repeated a
few dozen times, when some (but not all) tests are run:
amdgpu: init_user_pages: Failed to get user pages: -1
rocrand is a good example where these occur.
Despite the failure, I did not observe any negative side effects, but
the above change also did not solve this. Have you seen this message in
dmesg as well?
Yes, it can be observed in the logs I captured [1]. I'm not sure what it
means. I'll ask.
Sincerely,
Cory Bloor
[1]: https://lists.debian.org/debian-ai/2023/11/msg00043.html