> -----Original Message----- > From: Dan Williams <[email protected]> > Sent: 14 April 2026 09:39 > To: Manish Honap <[email protected]>; Alex Williamson > <[email protected]>; [email protected]; > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; > [email protected]; Yishai Hadas <[email protected]>; Shameer Kolothum Thodi > <[email protected]>; [email protected]; Ankit Agrawal > <[email protected]> > Cc: Vikram Sethi <[email protected]>; Neo Jia <[email protected]>; Tarun > Gupta (SW-GPU) <[email protected]>; Zhi Wang <[email protected]>; > Krishnakant Jaju <[email protected]>; [email protected]; > [email protected]; [email protected]; > [email protected]; Manish Honap <[email protected]>; Alex Williamson > <[email protected]>; Jonathan Cameron <[email protected]> > Subject: Re: [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device > passthrough support > > External email: Use caution opening links or attachments > > > Forgive me if any of the commentary below was already hashed out in the > v1 discussion. Your excellent changelog notes make catching up much > easier, thanks! > > mhonap@ wrote: > > From: Manish Honap <[email protected]> > > > > CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed > > through to virtual machines with stock vfio-pci because the driver has > > no concept of HDM decoder management, DPA region exposure, or > > component register emulation. This series wires all of that into > > vfio-pci-core behind a new CONFIG_VFIO_CXL_CORE optional module, > > without requiring a variant driver. > > > > When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at > > device open time, the driver: > > > > - Probes the HDM Decoder Capability block in the component registers > > and allocates a DPA region through the CXL subsystem. On devices > > where firmware has already committed a decoder, the kernel skips > > allocation and re-uses the committed range. > > > > - Builds a kernel-owned shadow of the HDM register block. The VMM > > reads and writes this shadow through a dedicated COMP_REGS VFIO > > region rather than touching the hardware directly. The kernel > > enforces CXL 3.1 bit-field rules: reserved bits, read-only bits, > > the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for > > firmware-committed decoders. > > > > - Exposes the DPA range as a second VFIO region > (VFIO_REGION_SUBTYPE_CXL) > > backed by the kernel-assigned HPA. PTEs are inserted lazily on > first > > page fault and torn down atomically under memory_lock during FLR. > > I assume, or hope this means expose a CXL region as > VFIO_REGION_SUBTYPE_CXL, as DPA is a device-internal address space that > VFIO probably does not need to worry about. VFIO likely only needs to > care about system visible resource.
Fair catch - that was incorrect wording. DPA is just what we hand to the CXL subsystem during allocation setup; guests never see it. I will fix the cover letter and any comments with the same mistake so that docs do not imply VFIO is exporting DPA directly. > > If / when interleaving arrives for CXL accelerators the 1:1 vfio-pci to > DPA to CXL region HPA association breaks. Ok, to assume 1:1 for now. Okay, v2 is designed for single-region / non-interleaved case; I will revisit association model in next round of review and update the changelog to state this decision. > > > - Intercepts writes to the CXL DVSEC configuration-space registers > > (Control, Status, Control2, Status2, Lock, Range Base) and replays > > Range Base is ignored when global HDM Decoder Control is enabled. I > would hope that this enabling ditches CXL 1.x legacy wherever possible. Noted on Range Base being ignored when global HDM decoder control is in play. I will audit the emulation path and the doc for any wording that sounds like we depend on legacy range-base behavior when global HDM is enabled. In the next review round, I will drop the Range Base handling from DVSEC emulation codepath. > > > them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO > > access semantics and the CONFIG_LOCK one-shot latch. > > Linux should have no need to ever trigger CXL register bit locks. That > is only for firmware to make changes immutable if the firmware has > requirements that nothing moves for its own purposes. > > Now, it makes sense to configure the vCXL device to be locked at setup, > but I do not currently see the use case for the vBIOS to mutate and lock > the configuration. > > [..] > > - Includes selftests > > Yay! Thank you! I will keep extending them as the UAPI surface stabilizes. > > > covering device detection, capability parsing, > > region enumeration, HDM register emulation, DPA mmap with page- > fault > > insertion, FLR invalidation, and DVSEC register emulation. > > > > The series is applied on top of the cxl/next branch using the base > > specified at the end of this cover letter plus Alejandro's v23 Type-2 > > device support patches [1]. > > One of the sticking points of the accelerator series has been how many > details of the CXL core internal object lifetime leak out. > > My hope / thought experiment is that the initial version of this > enabling only needs to facilitate getting a VMM established CXL region > into a guest. With that VFIO only needs is the CXL region HPA and MMIO > layout so that CXL registers can be trapped and non-CXL registers can be > direct mapped. Okay, I will investigate this part. > > > Series structure > > ================ > > > > Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs. > > > > Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state, > > Kconfig/build). > > > > Patches 9-15 implement the core device lifecycle: detection, HDM > > emulation, media readiness, region management, DPA region, and DVSEC > > emulation. > > > > Patches 16-18 wire everything together at open/close time and > > populate the VFIO ioctl paths. > > > > Patches 19-20 add documentation and selftests. > > > > Changes since v1 > > ================ > [..] > > HDM API simplification (patch 1) > > > > v1 exported cxl_get_hdm_reg_info() which returned a raw struct with > > offset and size fields. v2 replaces it with cxl_get_hdm_info() which > > uses the cached count already populated by > cxl_probe_component_regs() > > and returns a single struct with all HDM metadata, removing the need > > for callers to re-read the hardware. > > What is the accelerator use case to support multiple CXL regions per > device? For this version there isn't one. One committed decoder, one contiguous Region and restricted to decoder 0. I will think about addition of these aspects. > > In other words, it feels ambitious to support that while simultaneously > kicking the "interleave" question down the road. If we are going for > initial simplicity that also means single region to start. > > > cxl_await_range_active() split (patch 4) > > > > cxl_await_media_ready() requires a CXLMDEV mailbox register, which > > Type-2 accelerators may not have. v2 splits out > cxl_await_range_active() > > so the HDM range-active poll can be used independently of the media > > ready path. > > This feels like a detail vfio-pci does not need to worry about. The core > knows that the device does not have a mailbox and the core knows it > needs to await range ready when probing HDM. Something is broken if > vfio-pci needs to duplicate this part of the setup. Okay, I'll send an RFC to linux-cxl for this and refactor the patches in current series. > > > LOCK→0 transition in HDM ctrl write emulation (patch 11) > > > > v1 did not handle the case where a guest tries to clear the LOCK bit > > to reprogram a firmware-committed decoder. v2 allows this transition > > and re-programs the hardware accordingly. > > ? Guest has no ability to manipulate Host HPA mappings. A protocol for a > guest to work with a host to remap HPA does not sound like a v1 > requirement. This would be equivalent to a guest asking to move a host > PCI BAR. Okay, agreed. For initial support addition, I will drop the lock programming support for guest in next review. If in future we require HDM remapping, we can think of a separate mechanism for this instead of config-space writes. Manish

