On Mon, Sep 1, 2025 at 3:17 PM Jonah Palmer <[email protected]> wrote: > > > > On 9/1/25 2:57 AM, Eugenio Perez Martin wrote: > > On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <[email protected]> > > wrote: > >> > >> > >> > >> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote: > >>> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <[email protected]> > >>> wrote: > >>>> > >>>> > >>>> > >>>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote: > >>>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote: > >>>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer > >>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote: > >>>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <[email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin > >>>>>>>>>> wrote: > >>>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <[email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote: > >>>>>>>>>>>>> This effort was started to reduce the guest visible downtime by > >>>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, > >>>>>>>>>>>>> especially > >>>>>>>>>>>>> vhost-vDPA. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not > >>>>>>>>>>>>> from having to > >>>>>>>>>>>>> migrate a lot of state but rather expensive backend > >>>>>>>>>>>>> control-plane latency > >>>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN > >>>>>>>>>>>>> filters, offload > >>>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC > >>>>>>>>>>>>> operations which > >>>>>>>>>>>>> dominates its downtime. > >>>>>>>>>>>>> > >>>>>>>>>>>>> In other words, by migrating the state of virtio-net early > >>>>>>>>>>>>> (before the > >>>>>>>>>>>>> stop-and-copy phase), we can also start staging backend > >>>>>>>>>>>>> configurations, > >>>>>>>>>>>>> which is the main contributor of downtime when migrating a > >>>>>>>>>>>>> vhost-vDPA > >>>>>>>>>>>>> device. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I apologize if this series gives the impression that we're > >>>>>>>>>>>>> migrating a lot > >>>>>>>>>>>>> of data here. It's more along the lines of moving control-plane > >>>>>>>>>>>>> latency out > >>>>>>>>>>>>> of the stop-and-copy phase. > >>>>>>>>>>>> > >>>>>>>>>>>> I see, thanks. > >>>>>>>>>>>> > >>>>>>>>>>>> Please add these into the cover letter of the next post. IMHO > >>>>>>>>>>>> it's > >>>>>>>>>>>> extremely important information to explain the real goal of this > >>>>>>>>>>>> work. I > >>>>>>>>>>>> bet it is not expected for most people when reading the current > >>>>>>>>>>>> cover > >>>>>>>>>>>> letter. > >>>>>>>>>>>> > >>>>>>>>>>>> Then it could have nothing to do with iterative phase, am I > >>>>>>>>>>>> right? > >>>>>>>>>>>> > >>>>>>>>>>>> What are the data needed for the dest QEMU to start staging > >>>>>>>>>>>> backend > >>>>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already > >>>>>>>>>>>> have them in > >>>>>>>>>>>> the cmdlines? > >>>>>>>>>>>> > >>>>>>>>>>>> Asking this because I want to know whether it can be done > >>>>>>>>>>>> completely > >>>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts. > >>>>>>>>>>>> > >>>>>>>>>>>> If src QEMU's data is still needed, please also first consider > >>>>>>>>>>>> providing > >>>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel > >>>>>>>>>>>> free to > >>>>>>>>>>>> refer to commit 3b95a71b22827d26178. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> While it works for this series, it does not allow to resend the > >>>>>>>>>>> state > >>>>>>>>>>> when the src device changes. For example, if the number of > >>>>>>>>>>> virtqueues > >>>>>>>>>>> is modified. > >>>>>>>>>> > >>>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" > >>>>>>>>>> would help. > >>>>>>>>>> Not "it might preheat things", but exactly why, and how that > >>>>>>>>>> differs when > >>>>>>>>>> it's pure software, and when hardware will be involved. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes > >>>>>>>>> about ~200ms: > >>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$ > >>>>>>>>> > >>>>>>>>> Adding Dragos here in case he can provide more details. Maybe the > >>>>>>>>> numbers have changed though. > >>>>>>>>> > >>>>>>>>> And I guess the difference with pure SW will always come down to PCI > >>>>>>>>> communications, which assume it is slower than configuring the host > >>>>>>>>> SW > >>>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling > >>>>>>>>> is > >>>>>>>>> needed before making those claims. > >>>>>>>>> > >>>>>>>>> Jonah, can you print the time it takes to configure the vDPA device > >>>>>>>>> with traces vs the time it takes to enable the dataplane of the > >>>>>>>>> device? So we can get an idea of how much time we save with this. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Let me know if this isn't what you're looking for. > >>>>>>>> > >>>>>>>> I'm assuming by "configuration time" you mean: > >>>>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) > >>>>>>>> to right > >>>>>>>> before we start enabling the vrings (e.g. > >>>>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()). > >>>>>>>> > >>>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean: > >>>>>>>> - Time right before we start enabling the vrings (see above) > >>>>>>>> to right > >>>>>>>> after we enable the last vring (at the end of > >>>>>>>> vhost_vdpa_net_cvq_load()) > >>>>>>>> > >>>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs: > >>>>>>>> > >>>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0, > >>>>>>>> queues=8,x-svq=on > >>>>>>>> > >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1, > >>>>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on, > >>>>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000, > >>>>>>>> disable-legacy=on,disable-modern=off > >>>>>>>> > >>>>>>>> --- > >>>>>>>> > >>>>>>>> Configuration time: ~31s > >>>>>>>> Dataplane enable time: ~0.14ms > >>>>>>>> > >>>>>>> > >>>>>>> I was vague, but yes, that's representative enough! It would be more > >>>>>>> accurate if the configuration time ends by the time QEMU enables the > >>>>>>> first queue of the dataplane though. > >>>>>>> > >>>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the > >>>>>>> beginning of vhost_vdpa_dev_start? > >>>>>>> > >>>>>> > >>>>>> Ah, I also realized that Qemu I was using for measurements was using a > >>>>>> version before the listener_registered member was introduced. > >>>>>> > >>>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.: > >>>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 > >>>>>> times for measurements. > >>>>>> > >>>>>> v->shared->listener_registered == false at the beginning of > >>>>>> vhost_vdpa_dev_start(). > >>>>>> > >>>>> > >>>>> Let's move out the effect of the mem pinning from the downtime by > >>>>> registering the listener before the migration. Can you check why is it > >>>>> not registered at vhost_vdpa_set_owner? > >>>>> > >>>> > >>>> Sorry I was profiling improperly. The listener is registered at > >>>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set > >>>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows > >>>> as false and is re-registered later in the function. > >>>> > >>>> Should we always expect listener_registered == true at every > >>>> vhost_vdpa_dev_start call during startup? > >>> > >>> Yes, that leaves all the memory pinning time out of the downtime. > >>> > >>>> This is what I traced during > >>>> startup of a single guest (no migration). > >>> > >>> We can trace the destination's QEMU to be more accurate, but probably > >>> it makes no difference. > >>> > >>>> Tracepoint is right at the > >>>> start of the vhost_vdpa_dev_start function: > >>>> > >>>> vhost_vdpa_set_owner() - register memory listener > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>> > >>> This is surprising. Can you trace how listener_registered goes to 0 again? > >>> > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >>>> ... > >>>> * VQs are now being enabled * > >>>> > >>>> I'm also seeing that when the guest is being shutdown, > >>>> dev->vhost_ops->vhost_get_vring_base() is failing in > >>>> do_vhost_virtqueue_stop(): > >>>> > >>>> ... > >>>> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block > >>>> devices. > >>>> [ 114.719255] systemd-shutdown[1]: Powering off. > >>>> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache > >>>> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5 > >>>> [ 114.725593] reboot: Power down > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not > >>>> permitted (1) > >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > >>>> > >>>> However when x-svq=on, I don't see these errors on shutdown. > >>>> > >>> > >>> SVQ can mask this error as it does not need to forward the ring > >>> restore message to the device. It can just start with 0 and convert > >>> indexes. > >>> > >>> Let's focus on listened_registered first :). > >>> > >>>>>> --- > >>>>>> > >>>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() > >>>>>> to > >>>>>> right after Qemu enables the first VQ. > >>>>>> - 26.947s, 26.606s, 27.326s > >>>>>> > >>>>>> Enable dataplane: Time from right after first VQ is enabled to right > >>>>>> after the last VQ is enabled. > >>>>>> - 0.081ms, 0.081ms, 0.079ms > >>>>>> > >>>>> > >>>> > >>> > >> > >> I looked into this a bit more and realized I was being naive thinking > >> that the vhost-vDPA device startup path of a single VM would be the same > >> as that on a destination VM during live migration. This is **not** the > >> case and I apologize for the confusion I caused. > >> > >> What I described and profiled above is indeed true for the startup of a > >> single VM / source VM with a vhost-vDPA device. However, this is not > >> true on the destination side and its configuration time is drastically > >> different. > >> > >> Under the same specs, but now with a live migration performed between a > >> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs), > >> and using the same tracepoints to find the configuration time and enable > >> dataplane time, these are the measurements I found for the **destination > >> VM**: > >> > >> Configuration time: Time from first entry into vhost_vdpa_dev_start to > >> right after Qemu enables the first VQ. > >> - 268.603ms, 241.515ms, 249.007ms > >> > >> Enable dataplane time: Time from right after the first VQ is enabled to > >> right after the last VQ is enabled. > >> - 0.072ms, 0.071ms, 0.070ms > >> > >> --- > >> > >> For those curious, using the same printouts as I did above, this is what > >> it actually looks like on the destination side: > >> > >> * Destination VM is started * > >> > >> vhost_vdpa_set_owner() - register memory listener > >> vhost_vdpa_reset_device() - unregistering listener > >> > >> * Start live migration on source VM * > >> (qemu) migrate unix:/tmp/lm.sock > >> ... > >> > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > >> vhost_vdpa_dev_start() - register listener > >> > > > > That's weird, can you check why the memory listener is not registered > > at vhost_vdpa_set_owner? Or, if it is registered, why is it not > > registered by the time vhost_vdpa_dev_start is called? This changes > > the downtime a lot, more than half of the time is spent on this. So it > > is worth fixing it before continuing. > > > > The memory listener is registered at vhost_vdpa_set_owner, but the > reason we see v->shared->listener_registered == 0 by the time > vhost_vdpa_dev_start is called is due to the vhost_vdpa_reset_device > that's called shortly after. >
Ok, I missed the status of this. This first reset is avoidable actually. I see two routes for this: 1) Do not reset if shared->listener_registered. Maybe we should rename that member actually, as now it means something like "The device is blank and ready to be configured". Or maybe dedicate two variables or flags, is a shame to lose the precision of "listener_registered". 2) Implement the VHOST_BACKEND_F_IOTLB_PERSIST part of Si-Wei's series [1]. I'd greatly prefer option 1, as it does not depend on the backend features and it is more generic. But the option 2 will be needed to reduce the SVQ transition downtime too. > But this re-registering is relatively quick compared to how long it > takes during its initialization sequence. > That's interesting, I guess it is because the regions are warm. Can you measure the time of it so we can evaluate if it is worth comparing with the iterative migration? Thanks! [1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
