From: Honglei Huang <[email protected]>

The intent of this series is to make drm_gpusvm more flexible and
give drivers more freedom over how they assemble the MM related and device
side operations.

This RFC implements the direction Matt suggested in [1]:

  - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
  - Embed either a struct device or a struct drm_device in struct
    drm_gpusvm_pages.
  - Drop struct drm_device from struct drm_gpusvm.
  - Have the driver's range structure embed one or more struct
    drm_gpusvm_pages in addition to struct drm_gpusvm_range.
  - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
    drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or
    simply drop them entirely and update drivers to use the
    drm_gpusvm_pages helpers instead.

In essence the series does only two abstractions, plus the xe
adaptation that follows from them:

  - range vs pages: split drm_gpusvm_range (MM / VA range state) from
    drm_gpusvm_pages (device physical related), so the two
    sides can have independent lifetimes and ownership.
  - drm_gpusvm vs drm_device: make drm_gpusvm pure MM level and push
    the device side down onto drm_gpusvm_pages, which is where DMA 
    actually happens.
  - xe is updated to fit the modifications, no functional change intended.

If such changes are acceptable in terms of direction, I have a few questions:

 - Drivers now own drm_gpusvm_pages unmap / free and notifier_seq init. 
   OK to push this fully to drivers, or should some new mechanisms need to add 
   to ensure functions can be completed by the framework?
 - This series drops the three drm_gpusvm_range_* helpers and changes
   drm_gpusvm_get_pages() / drm_gpusvm_init() signatures.
   Do we need to keep thin wrappers for backward compatibility.
 - drm_gpusvm_get_pages() mixes HMM fault and device DMA map. Multi device 
under 
   one SVM calls would repeat the HMM fault. Does it need to modified to Split
   into MM level fault + per pages DMA map?

Patch overview:

  1/5 gpusvm: split MM state flags onto drm_gpusvm_range_flags.
  2/5 gpusvm: embed drm_device into drm_gpusvm_pages; DMA goes
      through it.
  3/5 xe: xe_svm_range owns its drm_gpusvm_pages and its lifecycle.
  4/5 gpusvm: drop pages from drm_gpusvm_range and the range-level
      wrappers.
  5/5 gpusvm: drop drm_device from drm_gpusvm.

tests:
AMDGPU:
  based on amdgpu adaptation patch in [2], but still SVM:DRM = 1:1,
  1:n is on going needs many modifications and testings.

  Tested on gfx943 (MI300X) and gfx906 (MI60) with XNACK on/off:
  - KFD test: 95%+ passed.
  - ROCR test: all passed.
  - HIP catch test: gfx943 (MI300X): 96% passed.
                    gfx906 (MI60): 99% passed.
INTEL XE:
  TODO: We bought some Intel Arc A380, but it seems like this cards 
  don't support hardware fault / SVM, waiting for the new 
  cards B580/B570 to arrive.

links:
[1] 
https://lore.kernel.org/amd-gfx/acRgr7QwdULsn6G2@gsse-cloud1/#:~:text=I%20think%20roughly,drm_gpusvm_pages%0A%20%20helpers%20instead.
[2] 
https://lore.kernel.org/amd-gfx/[email protected]/
Honglei Huang (5):
  drm/gpusvm: split MM state flags out of drm_gpusvm_pages_flags
  drm/gpusvm: embed struct drm_device into drm_gpusvm_pages
  drm/xe: have xe_svm_range embed one drm_gpusvm_pages
  drm/gpusvm: move struct drm_gpusvm_pages out of struct
    drm_gpusvm_range
  drm/gpusvm: let the drm_gpusvm core context purely MM level

 drivers/gpu/drm/drm_gpusvm.c    | 128 +++++++++-----------------------
 drivers/gpu/drm/xe/xe_pt.c      |   2 +-
 drivers/gpu/drm/xe/xe_svm.c     |  37 +++++----
 drivers/gpu/drm/xe/xe_svm.h     |  11 ++-
 drivers/gpu/drm/xe/xe_userptr.c |   1 +
 include/drm/drm_gpusvm.h        |  49 ++++++------
 6 files changed, 95 insertions(+), 133 deletions(-)

-- 
2.34.1

Reply via email to