Hi Christian,

Really thanks for the detailed feedback and insights. Your comments are incredibly helpful and clear.

On 2025/11/12 16:34, Christian König wrote:
Hi,

On 11/12/25 08:29, Honglei Huang wrote:
Hi all,

This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.

Background and Motivation
==========================

Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications

Use Case Example
================

Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.

Yeah, that's an intentional limitation.

In an IOCTL interface you usually need to guarantee that the operation either 
completes or fails in a transactional manner.

It is possible to implement this, but usually rather tricky if you do multiple 
operations in a single IOCTL. So you really need a good use case to justify the 
added complexity.


You're absolutely right about the transactional complexity. This operation indeed requires proper rollback mechanisms and error handling to maintain atomicity.


Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.
SVM with dynamic migration under KVM is most likely a dead end to begin with.

The only possibility to implement it is with memory pinning which is basically 
userptr.

Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers 
to get the necessary information onto the host side.

Intel calls this approach colIOMMU: 
https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf


This is very helpful context.Your confirmation that memory pinning (userptr-style) is the practical approach helps me understand that what I initially saw as a "workaround" is actually the intended solution for this use case. For colIOMMU, I'll study it to better understand the alternatives and their trade-offs.

Current Implementation - A Workaround Approach
===============================================

This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:

**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM

That again is perfectly intentional. Any other mode doesn't really make sense 
with KVM.

**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration

Why Submit This RFC?
====================

Despite the limitations above, I am submitting this series to:

1. **Start the Discussion**: I want community feedback on whether batch
    registration is a useful feature worth pursuing.

2. **Explore Better Alternatives**: Is there a way to achieve batch
    registration without pinning? Could I extend HMM to better support
    this use case?

There is an ongoing unification project between KFD and KGD, we are currently 
looking into the SVM part on a weekly basis.

Saying that we probably need a really good justification to add new features to 
the KFD interfaces cause this is going to delay the unification.

Regards,
Christian.

Thank you for sharing this critical information. Is there a public discussion forum or mailing list for the KFD/KGD unification where I could follow progress and understand the design direction?

Regarding the use case justification: I need to be honest here - the
primary driver for this feature is indeed KVM/virtualized environments.
The scattered allocation problem exists in native environments too, but
the overhead is tolerable there. However, I do want to raise one consideration for the unified interface design:

GPU computing in virtualized/cloud environments is growing rapidly, major cloud providers (AWS, Azure) now offer GPU instances ROCm in containers/VMs is becoming more common.So while my current use case is specific to KVM, the virtualized GPU workload pattern may become more prevalent.

So during the unified interface design, please keep the door open for batch-style operations if they don't complicate the core design.

I really appreciate your time and guidance on this.

Regards,
Honglei





3. **Understand Trade-offs**: For some workloads, the performance benefit
    of batch registration might outweigh the drawbacks of pinning. I'd
    like to understand where the balance lies.

Questions for the Community
============================

1. Are there existing mechanisms in HMM or mm that could support batch
    operations without pinning?

2. Would a different approach (e.g., async registration, delayed validation)
    be more acceptable?

Alternative Approaches Considered
==================================

I've considered several alternatives:

A) **Pure HMM approach**: Register ranges without pinning, rely entirely on

B) **Userspace batching library**: Hide multiple ioctls behind a library.

Patch Series Overview
=====================

Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing

Testing
=======

The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
   on HuggingFace transformers

I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.

Thank you for your time and consideration.

Best regards,
Honglei Huang

---

Honglei Huang (5):
   drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
   drm/amdkfd: Add SVM ranges data structures
   drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
   drm/amdkfd: Add support for pinned user pages in SVM ranges
   drm/amdkfd: Wire up SVM ranges ioctl handler

  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
  include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
  4 files changed, 348 insertions(+), 6 deletions(-)


Reply via email to