Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration

Eugenio Perez Martin Tue, 02 Sep 2025 03:26:59 -0700

On Mon, Sep 1, 2025 at 3:17 PM Jonah Palmer <[email protected]> wrote:
>
>
>
> On 9/1/25 2:57 AM, Eugenio Perez Martin wrote:
> > On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <[email protected]> 
> > wrote:
> >>
> >>
> >>
> >> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <[email protected]> 
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> >>>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer 
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin 
> >>>>>>>>>> wrote:
> >>>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <[email protected]> 
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, 
> >>>>>>>>>>>>> especially
> >>>>>>>>>>>>> vhost-vDPA.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not 
> >>>>>>>>>>>>> from having to
> >>>>>>>>>>>>> migrate a lot of state but rather expensive backend 
> >>>>>>>>>>>>> control-plane latency
> >>>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN 
> >>>>>>>>>>>>> filters, offload
> >>>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC 
> >>>>>>>>>>>>> operations which
> >>>>>>>>>>>>> dominates its downtime.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In other words, by migrating the state of virtio-net early 
> >>>>>>>>>>>>> (before the
> >>>>>>>>>>>>> stop-and-copy phase), we can also start staging backend 
> >>>>>>>>>>>>> configurations,
> >>>>>>>>>>>>> which is the main contributor of downtime when migrating a 
> >>>>>>>>>>>>> vhost-vDPA
> >>>>>>>>>>>>> device.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I apologize if this series gives the impression that we're 
> >>>>>>>>>>>>> migrating a lot
> >>>>>>>>>>>>> of data here. It's more along the lines of moving control-plane 
> >>>>>>>>>>>>> latency out
> >>>>>>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I see, thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please add these into the cover letter of the next post.  IMHO 
> >>>>>>>>>>>> it's
> >>>>>>>>>>>> extremely important information to explain the real goal of this 
> >>>>>>>>>>>> work.  I
> >>>>>>>>>>>> bet it is not expected for most people when reading the current 
> >>>>>>>>>>>> cover
> >>>>>>>>>>>> letter.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then it could have nothing to do with iterative phase, am I 
> >>>>>>>>>>>> right?
> >>>>>>>>>>>>
> >>>>>>>>>>>> What are the data needed for the dest QEMU to start staging 
> >>>>>>>>>>>> backend
> >>>>>>>>>>>> configurations to the HWs underneath?  Does dest QEMU already 
> >>>>>>>>>>>> have them in
> >>>>>>>>>>>> the cmdlines?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Asking this because I want to know whether it can be done 
> >>>>>>>>>>>> completely
> >>>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If src QEMU's data is still needed, please also first consider 
> >>>>>>>>>>>> providing
> >>>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel 
> >>>>>>>>>>>> free to
> >>>>>>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> While it works for this series, it does not allow to resend the 
> >>>>>>>>>>> state
> >>>>>>>>>>> when the src device changes. For example, if the number of 
> >>>>>>>>>>> virtqueues
> >>>>>>>>>>> is modified.
> >>>>>>>>>>
> >>>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" 
> >>>>>>>>>> would help.
> >>>>>>>>>> Not "it might preheat things", but exactly why, and how that 
> >>>>>>>>>> differs when
> >>>>>>>>>> it's pure software, and when hardware will be involved.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>>>>>> about ~200ms:
> >>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>>>>>
> >>>>>>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>>>>>> numbers have changed though.
> >>>>>>>>>
> >>>>>>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>>>>>> communications, which assume it is slower than configuring the host 
> >>>>>>>>> SW
> >>>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling 
> >>>>>>>>> is
> >>>>>>>>> needed before making those claims.
> >>>>>>>>>
> >>>>>>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>>>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>>>>>> device? So we can get an idea of how much time we save with this.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Let me know if this isn't what you're looking for.
> >>>>>>>>
> >>>>>>>> I'm assuming by "configuration time" you mean:
> >>>>>>>>       - Time from device startup (entry to vhost_vdpa_dev_start()) 
> >>>>>>>> to right
> >>>>>>>>         before we start enabling the vrings (e.g.
> >>>>>>>>         VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>>>>>
> >>>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>>>>>>       - Time right before we start enabling the vrings (see above) 
> >>>>>>>> to right
> >>>>>>>>         after we enable the last vring (at the end of
> >>>>>>>>         vhost_vdpa_net_cvq_load())
> >>>>>>>>
> >>>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>>>>>
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>>>>>>              queues=8,x-svq=on
> >>>>>>>>
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>>>>>>              romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>>>>>>              ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>>>>>>              disable-legacy=on,disable-modern=off
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> Configuration time:    ~31s
> >>>>>>>> Dataplane enable time: ~0.14ms
> >>>>>>>>
> >>>>>>>
> >>>>>>> I was vague, but yes, that's representative enough! It would be more
> >>>>>>> accurate if the configuration time ends by the time QEMU enables the
> >>>>>>> first queue of the dataplane though.
> >>>>>>>
> >>>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>>>>>> beginning of vhost_vdpa_dev_start?
> >>>>>>>
> >>>>>>
> >>>>>> Ah, I also realized that Qemu I was using for measurements was using a
> >>>>>> version before the listener_registered member was introduced.
> >>>>>>
> >>>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >>>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >>>>>> times for measurements.
> >>>>>>
> >>>>>> v->shared->listener_registered == false at the beginning of
> >>>>>> vhost_vdpa_dev_start().
> >>>>>>
> >>>>>
> >>>>> Let's move out the effect of the mem pinning from the downtime by
> >>>>> registering the listener before the migration. Can you check why is it
> >>>>> not registered at vhost_vdpa_set_owner?
> >>>>>
> >>>>
> >>>> Sorry I was profiling improperly. The listener is registered at
> >>>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> >>>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> >>>> as false and is re-registered later in the function.
> >>>>
> >>>> Should we always expect listener_registered == true at every
> >>>> vhost_vdpa_dev_start call during startup?
> >>>
> >>> Yes, that leaves all the memory pinning time out of the downtime.
> >>>
> >>>> This is what I traced during
> >>>> startup of a single guest (no migration).
> >>>
> >>> We can trace the destination's QEMU to be more accurate, but probably
> >>> it makes no difference.
> >>>
> >>>> Tracepoint is right at the
> >>>> start of the vhost_vdpa_dev_start function:
> >>>>
> >>>> vhost_vdpa_set_owner() - register memory listener
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>
> >>> This is surprising. Can you trace how listener_registered goes to 0 again?
> >>>
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> ...
> >>>> * VQs are now being enabled *
> >>>>
> >>>> I'm also seeing that when the guest is being shutdown,
> >>>> dev->vhost_ops->vhost_get_vring_base() is failing in
> >>>> do_vhost_virtqueue_stop():
> >>>>
> >>>> ...
> >>>> [  114.718429] systemd-shutdown[1]: Syncing filesystems and block 
> >>>> devices.
> >>>> [  114.719255] systemd-shutdown[1]: Powering off.
> >>>> [  114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >>>> [  114.724826] ACPI: PM: Preparing to enter system sleep state S5
> >>>> [  114.725593] reboot: Power down
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>>
> >>>> However when x-svq=on, I don't see these errors on shutdown.
> >>>>
> >>>
> >>> SVQ can mask this error as it does not need to forward the ring
> >>> restore message to the device. It can just start with 0 and convert
> >>> indexes.
> >>>
> >>> Let's focus on listened_registered first :).
> >>>
> >>>>>> ---
> >>>>>>
> >>>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() 
> >>>>>> to
> >>>>>> right after Qemu enables the first VQ.
> >>>>>>      - 26.947s, 26.606s, 27.326s
> >>>>>>
> >>>>>> Enable dataplane: Time from right after first VQ is enabled to right
> >>>>>> after the last VQ is enabled.
> >>>>>>      - 0.081ms, 0.081ms, 0.079ms
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> I looked into this a bit more and realized I was being naive thinking
> >> that the vhost-vDPA device startup path of a single VM would be the same
> >> as that on a destination VM during live migration. This is **not** the
> >> case and I apologize for the confusion I caused.
> >>
> >> What I described and profiled above is indeed true for the startup of a
> >> single VM / source VM with a vhost-vDPA device. However, this is not
> >> true on the destination side and its configuration time is drastically
> >> different.
> >>
> >> Under the same specs, but now with a live migration performed between a
> >> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
> >> and using the same tracepoints to find the configuration time and enable
> >> dataplane time, these are the measurements I found for the **destination
> >> VM**:
> >>
> >> Configuration time: Time from first entry into vhost_vdpa_dev_start to
> >> right after Qemu enables the first VQ.
> >>      - 268.603ms, 241.515ms, 249.007ms
> >>
> >> Enable dataplane time: Time from right after the first VQ is enabled to
> >> right after the last VQ is enabled.
> >>      - 0.072ms, 0.071ms, 0.070ms
> >>
> >> ---
> >>
> >> For those curious, using the same printouts as I did above, this is what
> >> it actually looks like on the destination side:
> >>
> >> * Destination VM is started *
> >>
> >> vhost_vdpa_set_owner() - register memory listener
> >> vhost_vdpa_reset_device() - unregistering listener
> >>
> >> * Start live migration on source VM *
> >> (qemu) migrate unix:/tmp/lm.sock
> >> ...
> >>
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - register listener
> >>
> >
> > That's weird, can you check why the memory listener is not registered
> > at vhost_vdpa_set_owner? Or, if it is registered, why is it not
> > registered by the time vhost_vdpa_dev_start is called? This changes
> > the downtime a lot, more than half of the time is spent on this. So it
> > is worth fixing it before continuing.
> >
>
> The memory listener is registered at vhost_vdpa_set_owner, but the
> reason we see v->shared->listener_registered == 0 by the time
> vhost_vdpa_dev_start is called is due to the vhost_vdpa_reset_device
> that's called shortly after.
>


Ok, I missed the status of this.

This first reset is avoidable actually. I see two routes for this:
1) Do not reset if shared->listener_registered. Maybe we should rename
that member actually, as now it means something like "The device is
blank and ready to be configured". Or maybe dedicate two variables or
flags, is a shame to lose the precision of "listener_registered".
2) Implement the VHOST_BACKEND_F_IOTLB_PERSIST part of Si-Wei's series [1].

I'd greatly prefer option 1, as it does not depend on the backend
features and it is more generic. But the option 2 will be needed to
reduce the SVQ transition downtime too.

> But this re-registering is relatively quick compared to how long it
> takes during its initialization sequence.
>

That's interesting, I guess it is because the regions are warm. Can
you measure the time of it so we can evaluate if it is worth comparing
with the iterative migration?

Thanks!

[1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html

Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration

Reply via email to