On Wed, 2014-02-12 at 19:10 +0100, Benoît Canet wrote: > Hi Alex, > > After the IRC conversation we had a few days ago I understood that guest IOMMU > was not implemented. > > I have a real use case for it: > > Cisco usnic allow to write MPI applications while driving the network card in > userspace in order to optimize the latency. It's made for compute clusters. > > The typical cloud provider don't provide bare metal access but only vms on top > of Cisco's hardware hence VFIO is using the IOMMU to passthrough the NIC to > the > guest and no IOMMU is present in the guest. > > questions: Would writing a performing guest IOMMU implementation be possible ? > How complex this project looks for someone knowing IOMMUs issues ? > > The ideal implementation would forward the IOMMU work to the host hardware for > speed. > > I can devote time writing the feature if it's doable.
Hi Benoît, I imagine it's doable, but it's certainly not trivial, beyond that I haven't put much thought into it. VFIO running in a guest would need an IOMMU that implements both the IOMMU API and IOMMU groups. Whether that comes from an emulated physical IOMMU (like VT-d) or from a new paravirt IOMMU would be for you to decide. VT-d would imply using a PCIe chipset like Q35 and trying to bandage on VT-d or updating Q35 to something that natively supports VT-d. Getting a sufficiently similar PCIe hierarchy between host an guest would also be required. The current model of putting all guest devices in a single IOMMU domain on the host is likely not what you would want and might imply a new VFIO IOMMU backend that is better tuned for separate domains, sparse mappings, and low-latency. VFIO has a modular IOMMU design, so this isn't architecturally a problem. The VFIO user (QEMU) is able to select which backend to use and the code is written with supporting multiple backends in mind. A complication you'll have is that the granularity of IOMMU operations through VFIO is at the IOMMU group level, so the guest would not be able to easily split devices grouped together on the host between separate users in the guest. That could be modeled as a conventional PCI bridge masking the requester ID of devices in the guest such that host groups are mirrored as guest groups. There might also be more simple "punch-through" ways to do it, for instance what if instead of trying to make it work like it does on the host we invented a paravirt VFIO interface and the vfio-pv driver in the guest populated /dev/vfio as slightly modified passthroughs to the host fds. The guest OS may not even really need to be aware of the device. It's an interesting project and certainly a valid use case. I'd also like to see things like Intel's DPDK move to using VFIO, but the current UIO DPDK is often used in guests. Thanks, Alex