On Mon, Sep 08, 2025 at 07:38:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 08.09.25 18:35, Peter Xu wrote:
> > On Fri, Sep 05, 2025 at 04:50:34PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 2387c21e9c..992a5b1e2b 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -517,6 +517,12 @@
> > >   #     each RAM page.  Requires a migration URI that supports seeking,
> > >   #     such as a file.  (since 9.0)
> > >   #
> > > +# @local-tap: Migrate TAPs locally, keeping backend alive. Open file
> > > +#     descriptors and TAP-related state are migrated. Only may be
> > > +#     used when migration channel is unix socket. For target device
> > > +#     also @local-incoming option must be specified (since 10.2)
> > > +#     (since 10.2)
> > 
> > IMHO we should move this into a per-device property, at least we need one
> > there to still control the device behavior; we had a similar discussion
> > recently on iterable virtio-net.
> > 
> > But maybe this one is slightly special?  Maybe the tap device needs to at
> > least know whether in this specific migration, if we want to pass over FD
> > or not (e.g. local upgrade, or remote _real_ migration)?
> > 
> > If that's the case, we may consider providing a generic migration
> > capability, like cap-fd-passing.  Nowadays since Fabiano's moving migration
> > capabilities all over to migration parameters, this one can start with a
> > parameter instead of a capability.  The problem with migration capability
> > is (at least) that it can't by default ON on any machine types.. meanwhile
> > it simply looks like identital to parameters except it's always bool.
> > 
> > The high level rational is that we should never add a per-device cap flag
> > into migration framework.
> > 
> 
> Hmm.
> 
> 1. Yes, we need to distinguish, is that _real_ migration or local. And 
> setting a
> special property for each device (which supports fd-migration) to turn on 
> passing
> FD to the channel seems not comfortable and error prune.
> 
> 2. Initially, I decided separate "local-tap" and "local-vhost-user-blk" 
> capabilities
> just to simplify further testing/debugging in real environment: the 
> possibility to
> enable only "half of magic" helps.
> 
> So, granularity makes sence, but having local-XXX capability for each device 
> class
> looks bad.
> 
> Maybe, having generic cap-fd-passing, together with possibility to disable it 
> on
> per-device basis (like migrate-fd=false) is good compromise.
> 
> 
> Another question is, do we need "local-incoming" option for target device.
> 
> Initially I added this because I thought: ho, I need to distinguish it 
> initialization
> time: do I need to call open(), or wait for incoming fd.
> 
> Now I see that I can just postpone this decision up to "start" point, where
> 
> - either I already have fd from incoming migration
> - or have nothing: in this case, let's call open()
> 
> -
> 
> I'll try to go with one "fd-passing" capability, as any kind of granularity 
> may be
> added later on demand.
> 
> 
> Hmm2. Probably we can avoid even adding such a capability, but just check, is 
> migration
> channel support fd passing or not? Seems too implicit for me.

If we want to expose a feature internally, IIUC we can use QAPI "features"
like this:

https://lore.kernel.org/all/[email protected]/

But I'm not yet sure whether it's useful..

In this case the "capability" itself should almost always be present when
using unix sockets..  The problem is, IIUC we're not trying to describe a
capability, but a choice the user made.

For example, when unix socket is the transport, we can still decide to not
use fd passing even if it's fully supported in the current QEMU binary for
any devices that are involved, because any of: (1) it could be a unix
socket to a proxy daemon (of a container?) when fd passing isn't supported
in the daemon, or (2) as you mentioned above, for debugging purpose when we
want to triage whether a bug is relevant to fd-passing.  Maybe more.

The per-device granularity you mentioned also makes sense to me.

An use case is when, imagine, we have a QEMU that (1) supports tap local
migration, but (2) doesn't yet support virtio-blk local migration.  Then we
want to be able to enable the fd-passing for tap/virtio-net, but not for
virtio-blk (even if the src QEMU in the context might support both)?

IOW, it makes sense to me to have two layers of controls here:

  (a) Migration new parameter, "migrate-fds" (or any better name..).

      When set, it enables all devices that supports fd-passing to migrate
      the fds directly.  OTOH, when not set, even if all devices enabled
      fd-passing, it should still do a full migration.  This one is the
      user knob saying "I want to migrate with fd migrated".

      This should imply unix sockets for sure as the transport, and should
      fail upfront if it's not a unix socket.

      We should also auto-select this with cpr migrations..  then in any
      code path (whenever such path exists?) that the fds can be either
      migrated from cpr or main channels.

  (b) Device new parameter, "migrate-fds" (or any better name..).

      When set, the device will declare support migrating fds "whenever the
      migration applies", aka, when above (a) is selected first.

      Taking tap device as example here, setting it ON here means "please
      enable fd-passing whenever the user enables this migration option".
      So in tap code, it should migrate fd if both (a) and (b) are ON.
      When migrating to e.g. old QEMUs, here (b) should be OFF even if (a)
      is ON.

Would above make sense?

-- 
Peter Xu


Reply via email to