On Wed, May 19, 2021 at 03:08:17PM -0600, Alex Williamson wrote: > > +VFIO_USER_DMA_MAP > > +----------------- > > + > > +Message Format > > +^^^^^^^^^^^^^^ > > + > > ++--------------+------------------------+ > > +| Name | Value | > > ++==============+========================+ > > +| Message ID | <ID> | > > ++--------------+------------------------+ > > +| Command | 2 | > > ++--------------+------------------------+ > > +| Message size | 16 + table size | > > ++--------------+------------------------+ > > +| Flags | Reply bit set in reply | > > ++--------------+------------------------+ > > +| Error | 0/errno | > > ++--------------+------------------------+ > > +| Table | array of table entries | > > ++--------------+------------------------+ > > + > > +This command message is sent by the client to the server to inform it of > > the > > +memory regions the server can access. It must be sent before the server can > > +perform any DMA to the client. It is normally sent directly after the > > version > > +handshake is completed, but may also occur when memory is added to the > > client, > > +or if the client uses a vIOMMU. If the client does not expect the server to > > +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP > > +commands. If the server does not need to perform DMA then it can ignore > > such > > +commands but it must still reply to them. The table is an array of the > > +following structure: > > + > > +Table entry format > > +^^^^^^^^^^^^^^^^^^ > > + > > ++-------------+--------+-------------+ > > +| Name | Offset | Size | > > ++=============+========+=============+ > > +| Address | 0 | 8 | > > ++-------------+--------+-------------+ > > +| Size | 8 | 8 | > > ++-------------+--------+-------------+ > > +| Offset | 16 | 8 | > > ++-------------+--------+-------------+ > > +| Protections | 24 | 4 | > > ++-------------+--------+-------------+ > > +| Flags | 28 | 4 | > > ++-------------+--------+-------------+ > > +| | +-----+------------+ | > > +| | | Bit | Definition | | > > +| | +=====+============+ | > > +| | | 0 | Mappable | | > > +| | +-----+------------+ | > > ++-------------+--------+-------------+ > > + > > +* *Address* is the base DMA address of the region. > > +* *Size* is the size of the region. > > +* *Offset* is the file offset of the region with respect to the associated > > file > > + descriptor. > > It might help to explicitly state this value is ignored by the server > for non-mappable DMA, otherwise a server might assume a specific value > is required (ex. zero) for such cases.
Generally we say that unused inputs must be zero, but yes, this should be clarified, thanks. > > +* *Protections* are the region's protection attributes as encoded in > > + ``<sys/mman.h>``. > > +* *Flags* contains the following region attributes: > > + > > + * *Mappable* indicates that the region can be mapped via the mmap() > > system > > + call using the file descriptor provided in the message meta-data. > > + > > +This structure is 32 bytes in size, so the message size is: > > +16 + (# of table entries * 32). > > + > > +If a DMA region being added can be directly mapped by the server, an array > > of > > +file descriptors must be sent as part of the message meta-data. Each > > mappable > > +region entry must have a corresponding file descriptor. On AF_UNIX > > sockets, the > > Is this saying that if the client provides table entries where indexes > 1, 3, and 5 are indicated as mappable, then there must be an ancillary > file descriptor array with 3 entries, where fd[0] maps to entry[1], > fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the > same file descriptor? Yes. Though we are planning to change these calls to only support single regions which would make this moot. > > +file descriptors must be passed as SCM_RIGHTS type ancillary data. > > Otherwise, > > +if a DMA region cannot be directly mapped by the server, it can be > > accessed by > > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, > > explained > > +in `Read and Write Operations`_. A command to map over an existing region > > must > > +be failed by the server with ``EEXIST`` set in error field in the reply. > > + > > +Adding multiple DMA regions can partially fail. The response does not > > indicate > > +which regions were added and which were not, therefore it is a client > > +implementation detail how to recover from the failure. > > + > > +.. Note:: > > + The server can optionally remove succesfully added DMA regions making > > this > > + operation atomic. > > + The client can recover by attempting to unmap one by one all the DMA > > regions > > + in the VFIO_USER_DMA_MAP command, ignoring failures for regions that do > > not > > + exist. > > What's the benefit of specifying this server behavior as optional? I'm > afraid this unspecified error recovery behavior might actually deter > clients from performing batch mappings. Servers also have little > incentive to do their own cleanup if the client has no way to detect > that behavior. This may well be moot too. > > +VFIO_USER_DMA_UNMAP > > +------------------- > > + > > +This command message is sent by the client to the server to inform it that > > a > > +DMA region, previously made available via a VFIO_USER_DMA_MAP command > > message, > > +is no longer available for DMA. It typically occurs when memory is > > subtracted > > +from the client or if the client uses a vIOMMU. If the client does not > > expect > > +the server to perform DMA then it does not need to send to the server > > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA > > then > > +it can ignore such commands but it must still reply to them. The table is > > an > > I'm confused why expectation of DMA plays a factor here. For example, > if QEMU unplugs a DIMM and the server has an mmap of the file descriptor > related to that DIMM, does it get to retain the mmap if it doesn't > currently have any DMA queued targeting that address range? Can QEMU > skip sending an unmap if the PCI bus master bit is disabled on the > device preventing further DMA? How can the associated file descriptor > get released? This doesn't feel strongly specified. I thought we'd removed those sentences actually, as they're just confusing. In reality, everything is going to both send and handle map/unmap messages. > Are there any assumptions about address and size of the unmap command > relative to the original map command or is the client freely allowed to > bisect, overlap, or overextend previous mappings? Good question. Filed https://github.com/nutanix/libvfio-user/issues/504 to track this. I actually don't know what clients would like to be able to do in this respect. > > +* *Offset* is the file offset of the region with respect to the associated > > file > > + descriptor. > > +* *Protections* are the region's protection attributes as encoded in > > + ``<sys/mman.h>``. > > +* *Flags* contains the following region attributes: > > + > > + * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a dirty page > > bitmap > > + must be populated before unmapping the DMA region. The client must > > provide > > + a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, > > with > > + the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized. > > + > > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region (explained > > + below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags. > > How will this be extended when new flags are added to get new data or > new data formats? Note for instance that the kernel struct > vfio_iommu_type1_dma_unmap specifies the data[] as opaque in general and > only specifies it as struct vfio_bitmap for the case where > GET_DIRTY_BITMAP is specified. We're planning to do something like that for the new (actually more like vfio again) map/unmap format. > > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is > > mapped > > +then the server must release all references to that DMA region before > > replying, > > +which includes potentially in flight DMA transactions. Removing a portion > > of a > > +DMA region is possible. > > Ah, maybe this answers my question about unmap vs map, but it also seems > to contradict the description allowing the server to ignore unmap > requests if no DMA is expected when we state here that the server MUST > release references. The text is confusing (which is why I've removed it again). What it's really trying to say is: If there is a server implementation (such as the gpio-pci-idio-16 sample) that never needs to access guest memory at all, then the server can choose to ignore DMA_MAP/UNMAP - so it would never keep any references around in the first place. It's not a useful thing to mention in the spec IMHO. > Is this also a good place to point out that the max messages size of > 4096 is extremely limiting for returning a dirty bitmap for most use > cases? Some discussion of the error codes for such a case might be > relevant here too. It's a silly low default. The only implementation so far reports 65536 for what it's worth. We are also prototyping implementation changes such that this limit can be removed altogether; hopefully that will come in a future spec update. > > +VFIO_USER_VM_INTERRUPT > > + > > +Interrupt info format > > +^^^^^^^^^^^^^^^^^^^^^ > > + > > ++-----------+--------+------+ > > +| Name | Offset | Size | > > ++===========+========+======+ > > +| Sub-index | 16 | 4 | > > ++-----------+--------+------+ > > + > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used > > in PCI > > + MSI/X type interrupts. > > Sorry if I'm blind, but where is the index itself provided? You were confused because the message makes no sense as it's defined. It's been removed. Thanks for taking a look, much appreciated! regards john
