Ping On Sun, Feb 09, 2020 at 12:43:35PM -0500, Raphael Norwitz wrote: > > On Thu, Feb 06, 2020 at 03:32:38AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Jan 15, 2020 at 09:57:06PM -0500, Raphael Norwitz wrote: > > > The current vhost-user implementation in Qemu imposes a limit on the > > > maximum number of memory slots exposed to a VM using a vhost-user > > > device. This change provides a new protocol feature > > > VHOST_USER_F_CONFIGURE_SLOTS which, when enabled, lifts this limit and > > > allows a VM with a vhost-user device to expose a configurable number of > > > memory slots, up to the ACPI defined maximum. Existing backends which > > > do not support this protocol feature are unaffected. > > > > Hmm ACPI maximum seems to be up to 512 - is this too much to fit in a > > single message? So can't we just increase the number (after negotiating > > with remote) and be done with it, instead of add/remove? Or is there > > another reason to prefer add/remove? > > > > As mentioned in my cover letter, we experimented with simply increasing the > message size and it didn’t work on our setup. We debugged down to the socket > layer and found that on the receiving end the messages were truncated at > around 512 bytes, or around 16 memory regions. To support 512 memory regions > we would need a message size of around 32 <bytes per region> * 512 <regions> > + 8 <bytes for padding and region count> ~= 16k packet size. That would be 64 > times larger than the next largest message size. We thought it would be > cleaner > and more in line with the rest of the protocol to keep the message sizes > smaller. In particular, we thought memory regions should be treated like the > rings, which are sent over one message at a time instead of in one large > message. > Whether or not such a large message size can be made to work in our case, > separate messages will always work on Linux, and most likely all other UNIX > platforms QEMU is used on. >
> > > > > > This feature works by using three new messages, > > > VHOST_USER_GET_MAX_MEM_SLOTS, VHOST_USER_ADD_MEM_REG and > > > VHOST_USER_REM_MEM_REG. VHOST_USER_GET_MAX_MEM_SLOTS gets the > > > number of memory slots the backend is willing to accept when the > > > backend is initialized. Then, when the memory tables are set or updated, > > > a series of VHOST_USER_ADD_MEM_REG and VHOST_USER_REM_MEM_REG messages > > > are sent to transmit the regions to map and/or unmap instead of trying > > > to send all the regions in one fixed size VHOST_USER_SET_MEM_TABLE > > > message. > > > > > > The vhost_user struct maintains a shadow state of the VM’s memory > > > regions. When the memory tables are modified, the > > > vhost_user_set_mem_table() function compares the new device memory state > > > to the shadow state and only sends regions which need to be unmapped or > > > mapped in. The regions which must be unmapped are sent first, followed > > > by the new regions to be mapped in. After all the messages have been > > > sent, the shadow state is set to the current virtual device state. > > > > > > The current feature implementation does not work with postcopy migration > > > and cannot be enabled if the VHOST_USER_PROTOCOL_F_REPLY_ACK feature has > > > also been negotiated. > > > > Hmm what would it take to lift the restrictions? > > conflicting features like this makes is very hard for users to make > > an informed choice what to support. > > > > We would need a setup with a backend which supports these features (REPLY_ACK > and postcopy migration). At first glance it looks like DPDK could work but > I'm not sure how easy it will be to test postcopy migration with the resources > we have. > > > > Signed-off-by: Raphael Norwitz <[email protected]> > > > Signed-off-by: Peter Turschmid <[email protected]> > > > Suggested-by: Mike Cui <[email protected]> > > > --- > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > > > index af83fdd..fed6d02 100644 > > > --- a/hw/virtio/vhost-user.c > > > +++ b/hw/virtio/vhost-user.c > > > @@ -35,11 +35,29 @@ > > > #include <linux/userfaultfd.h> > > > #endif > > > > > > -#define VHOST_MEMORY_MAX_NREGIONS 8 > > > +#define VHOST_MEMORY_LEGACY_NREGIONS 8 > > > > Hardly legacy when this is intended to always be used e.g. with > > postcopy, right? > > > > How about 'BASELINE'? > > > + msg->hdr.size = sizeof(msg->payload.mem_reg.padding); > > > + msg->hdr.size += sizeof(VhostUserMemoryRegion); > > > + > > > + /* > > > + * Send VHOST_USER_REM_MEM_REG for memory regions in our shadow state > > > + * which are not found not in the device's memory state. > > > > double negation - could not parse this. > > > > Apologies - typo here. It should say “Send VHOST_USER_REM_MEM_REG for memory > regions in our shadow state which are not found in the device's memory > state.” > i.e. send messages to remove regions in the shadow state but not in the > updated > device state. > > > > + */ > > > + for (i = 0; i < u->num_shadow_regions; ++i) { > > > + reg = dev->mem->regions; > > > + > > > + for (j = 0; j < dev->mem->nregions; j++) { > > > + reg = dev->mem->regions + j; > > > + > > > + assert((uintptr_t)reg->userspace_addr == > > > reg->userspace_addr); > > > + mr = memory_region_from_host((void > > > *)(uintptr_t)reg->userspace_addr, > > > + &offset); > > > + fd = memory_region_get_fd(mr); > > > + > > > + if (reg_equal(&u->shadow_regions[i], reg)) { > > > + matching = true; > > > + found[j] = true; > > > + break; > > > + } > > > + } > > > + > > > + if (fd > 0 && !matching) { > > > + msg->hdr.request = VHOST_USER_REM_MEM_REG; > > > + msg->payload.mem_reg.region.userspace_addr = > > > reg->userspace_addr; > > > + msg->payload.mem_reg.region.memory_size = reg->memory_size; > > > + msg->payload.mem_reg.region.guest_phys_addr = > > > + reg->guest_phys_addr; > > > + msg->payload.mem_reg.region.mmap_offset = offset; > > > + > > > + if (vhost_user_write(dev, msg, &fd, 1) < 0) { > > > + return -1; > > > + } > > > + } > > > + } > > > +
