On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
> On 29.04.2024 17:09, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:35, Peter Xu wrote:
> > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. BerrangΓ©
> > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S.
> > > > > > > > > > Szmigiero wrote:
> > > > > > > > > > > I think one of the reasons for these results is that
> > > > > > > > > > > mixed (RAM + device
> > > > > > > > > > > state) multifd channels participate in the RAM sync
> > > > > > > > > > > process
> > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated
> > > > > > > > > > > channels don't.
> > > > > > > > > >
> > > > > > > > > > Firstly, I'm wondering whether we can have better names for
> > > > > > > > > > these new
> > > > > > > > > > hooks. Currently (only comment on the async* stuff):
> > > > > > > > > >
> > > > > > > > > > - complete_precopy_async
> > > > > > > > > > - complete_precopy
> > > > > > > > > > - complete_precopy_async_wait
> > > > > > > > > >
> > > > > > > > > > But perhaps better:
> > > > > > > > > >
> > > > > > > > > > - complete_precopy_begin
> > > > > > > > > > - complete_precopy
> > > > > > > > > > - complete_precopy_end
> > > > > > > > > >
> > > > > > > > > > ?
> > > > > > > > > >
> > > > > > > > > > As I don't see why the device must do something with async
> > > > > > > > > > in such hook.
> > > > > > > > > > To me it's more like you're splitting one process into
> > > > > > > > > > multiple, then
> > > > > > > > > > begin/end sounds more generic.
> > > > > > > > > >
> > > > > > > > > > Then, if with that in mind, IIUC we can already split
> > > > > > > > > > ram_save_complete()
> > > > > > > > > > into >1 phases too. For example, I would be curious whether
> > > > > > > > > > the performance
> > > > > > > > > > will go back to normal if we offloading
> > > > > > > > > > multifd_send_sync_main() into the
> > > > > > > > > > complete_precopy_end(), because we really only need one
> > > > > > > > > > shot of that, and I
> > > > > > > > > > am quite surprised it already greatly affects VFIO dumping
> > > > > > > > > > its own things.
> > > > > > > > > >
> > > > > > > > > > I would even ask one step further as what Dan was asking:
> > > > > > > > > > have you thought
> > > > > > > > > > about dumping VFIO states via multifd even during
> > > > > > > > > > iterations? Would that
> > > > > > > > > > help even more than this series (which IIUC only helps
> > > > > > > > > > during the blackout
> > > > > > > > > > phase)?
> > > > > > > > >
> > > > > > > > > To dump during RAM iteration, the VFIO device will need to
> > > > > > > > > have
> > > > > > > > > dirty tracking and iterate on its state, because the guest
> > > > > > > > > CPUs
> > > > > > > > > will still be running potentially changing VFIO state. That
> > > > > > > > > seems
> > > > > > > > > impractical in the general case.
> > > > > > > >
> > > > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > > >
> > > > > > > > My understanding is the recent VFIO work is based on the fact
> > > > > > > > that the VFIO
> > > > > > > > device can track device state changes more or less (besides
> > > > > > > > being able to
> > > > > > > > save/load full states). E.g. I still remember in our QE tests
> > > > > > > > some old
> > > > > > > > devices report much more dirty pages than expected during the
> > > > > > > > iterations
> > > > > > > > when we were looking into such issue that a huge amount of
> > > > > > > > dirty pages
> > > > > > > > reported. But newer models seem to have fixed that and report
> > > > > > > > much less.
> > > > > > > >
> > > > > > > > That issue was about GPU not NICs, though, and IIUC a major
> > > > > > > > portion of such
> > > > > > > > tracking used to be for GPU vRAMs. So maybe I was mixing up
> > > > > > > > these, and
> > > > > > > > maybe they work differently.
> > > > > > >
> > > > > > > The device which this series was developed against (Mellanox
> > > > > > > ConnectX-7)
> > > > > > > is already transferring its live state before the VM gets stopped
> > > > > > > (via
> > > > > > > save_live_iterate SaveVMHandler).
> > > > > > >
> > > > > > > It's just that in addition to the live state it has more than 400
> > > > > > > MiB
> > > > > > > of state that cannot be transferred while the VM is still running.
> > > > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > > >
> > > > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > > >
> > > > > > So during iteration phase VFIO cannot yet leverage the multifd
> > > > > > channels
> > > > > > when with this series, am I right?
> > > > >
> > > > > That's right.
> > > > >
> > > > > > Is it possible to extend that use case too?
> > > > >
> > > > > I guess so, but since this phase (iteration while the VM is still
> > > > > running)Β doesn't impact downtime it is much less critical.
> > > >
> > > > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > > > iteration data will still bottleneck at ~15Gbps on a common system setup
> > > > the best case, even if the hosts are 100Gbps direct connected. Would
> > > > that
> > > > be a concern in the future too, or it's known problem and it won't be
> > > > fixed
> > > > anyway?
> > >
> > > I think any improvements to the migration performance are good, even if
> > > they don't impact downtime.
> > >
> > > It's just that this patch set focuses on the downtime phase as the more
> > > critical thing.
> > >
> > > After this gets improved there's no reason why not to look at improving
> > > performance of the VM live phase too if it brings sensible improvements.
> > >
> > > > I remember Avihai used to have plan to look into similar issues, I hope
> > > > this is exactly what he is looking for. Otherwise changing migration
> > > > protocol from time to time is cumbersome; we always need to provide a
> > > > flag
> > > > to make sure old systems migrates in the old ways, new systems run the
> > > > new
> > > > ways, and for such a relatively major change I'd want to double check on
> > > > how far away we can support offload VFIO iterations data to multifd.
> > >
> > > The device state transfer is indicated by a new flag in the multifd
> > > header (MULTIFD_FLAG_DEVICE_STATE).
> > >
> > > If we are to use multifd channels for VM live phase transfers these
> > > could simply re-use the same flag type.
> >
> > Right, and that's also my major purpose of such request to consider both
> > issues.
> >
> > If supporting iterators can be easy on top of this, I am thinking whether
> > we should do this in one shot. The problem is even if the flag type can be
> > reused, old/new qemu binaries may not be compatible and may not migrate
> > well when:
> >
> > - The old qemu only supports the downtime optimizations
> > - The new qemu supports both downtime + iteration optimizations
>
> I think the situation here will be the same as with any new flag
> affecting the migration wire protocol - if the old version of QEMU
> doesn't support that flag then it has to be kept at its backward-compatible
> setting for migration to succeed.
>
> > IIUC, at least the device threads are currently created only at the end of
> > migration when switching over for the downtime-only optimization (aka, this
> > series). Then it means it won't be compatible with a new QEMU as the
> > threads there will need to be created before iteration starts to take
> > iteration data. So I believe we'll need yet another flag to tune the
> > behavior of such, one for each optimizations (downtime v.s. data during
> > iterations). If they work mostly similarly, I want to avoid two flags.
> > It'll be chaos for user to see such similar flags and they'll be pretty
> > confusing.
>
> The VFIO loading threads are created from vfio_load_setup(), which is
> called at the very beginning of the migration, so they should be already
> there.
>
> However, they aren't currently prepared to receive VM live phase data.
>
> > If possible, I wish we can spend some time looking into that if they're so
> > close, and if it's low hanging fruit when on top of this series, maybe we
> > can consider doing that in one shot.
>
> I'm still trying to figure out the complete explanation why dedicated
> device state channels improve downtime as there was a bunch of holidays
> last week here.
No rush. I am not sure whether it'll reduce downtime, but it may improve
total migration time when multiple devices are used.
>
> I will have a look later what would it take to add VM live phase multifd
> device state transfer support and also how invasive it would be as I
> think it's better to keep the number of code conflicts in a patch set
> to a manageable size as it reduces the chance of accidentally
> introducing regressions when forward-porting the patch set to the git master.
Yes it makes sense. It'll be good to look one step further in this case,
then:
- If it's easy to add support then we do in one batch, or
- If it's not easy to add support, but if we can find a compatible way so
that ABI can be transparent when adding that later, it'll be also nice, or
- If we have solid clue it should be a major separate work, and we must
need a new flag, then we at least know we should simply split the
effort due to that complexity
The hope is option (1)/(2) would work out.
I hope Avihai can also chim in here (or please reach him out) because I
remember he used to consider proposing such a whole solution, but maybe I
just misunderstood. I suppose no harm to check with him.
Thanks,
--
Peter Xu