On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
> On 9/5/2025 12:48 PM, Peter Xu wrote:
> > Add Vladimir and Dan.
> > 
> > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > > This patch series adds the live migration cpr-exec mode.
> > > 
> > > The new user-visible interfaces are:
> > >    * cpr-exec (MigMode migration parameter)
> > >    * cpr-exec-command (migration parameter)
> > > 
> > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> > > primary difference being that old QEMU directly exec's new QEMU.  The user
> > > specifies the command to exec new QEMU in the migration parameter
> > > cpr-exec-command.
> > > 
> > > Why?
> > > 
> > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > container and its assigned resources.  By contrast, cpr-transfer mode
> > > requires a new container to be created on the same host as the target of
> > > the CPR operation.  Resources must be reserved for the new container, 
> > > while
> > > the old container still reserves resources until the operation completes.
> > > Avoiding over commitment requires extra work in the management layer.
> > 
> > Can we spell out what are these resources?
> > 
> > CPR definitely relies on completely shared memory.  That's already not a
> > concern.
> > 
> > CPR resolves resources that are bound to devices like VFIO by passing over
> > FDs, these are not over commited either.
> > 
> > Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> > IMHO, but maybe something else?
> 
> Accounting is one issue, and it is not trivial.  Another is arranging 
> exclusive
> use of a set of CPUs, the same set for the old and new container, 
> concurrently.
> Another is avoiding namespace conflicts, the kind that make localhost 
> migration
> difficult.
> 
> > > This is one reason why a cloud provider may prefer cpr-exec.  A second 
> > > reason
> > > is that the container may include agents with their own connections to the
> > > outside world, and such connections remain intact if the container is 
> > > reused.
> > 
> > We discussed about this one.  Personally I still cannot understand why this
> > is a concern if the agents can be trivially started as a new instance.  But
> > I admit I may not know the whole picture.  To me, the above point is more
> > persuasive, but I'll need to understand which part that is over-commited
> > that can be a problem.
> 
> Agents can be restarted, but that would sever the connection to the outside
> world.  With cpr-transfer or any local migration, you would need agents
> outside of old and new containers that persist.
> 
> With cpr-exec, connections can be preserved without requiring the end user
> to reconnect, and can be done trivially, by preserving chardevs.  With that
> support in qemu, the management layer does nothing extra to preserve them.
> chardev support is not part of this series but is part of my vision,
> and makes exec mode even more compelling.
> 
> Management layers have a lot of code and complexity to manage live migration,
> resources, and connections.  It requires modification to support cpr-transfer.
> All that can be bypassed with exec mode.  Less complexity, less maintainance,
> and  fewer points of failure.  I know this because I implemented exec mode in
> OCI at Oracle, and we use it in production.

I wonders how this part works in Vladimir's use case.

> > After all, cloud hosts should preserve some extra memory anyway to make
> > sure dynamic resources allocations all the time (e.g., when live migration
> > starts, KVM pgtables can drastically increase if huge pages are enabled,
> > for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> > that those.. and when it's also temporary (src QEMU will release all
> > resources after live upgrade) then it looks manageable. >>
> > > How?
> > > 
> > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > and by sending the unique name and value of each descriptor to new QEMU
> > > via CPR state.
> > > 
> > > CPR state cannot be sent over the normal migration channel, because 
> > > devices
> > > and backends are created prior to reading the channel, so this mode sends
> > > CPR state over a second migration channel that is not visible to the user.
> > > New QEMU reads the second channel prior to creating devices or backends.
> > > 
> > > The exec itself is trivial.  After writing to the migration channels, the
> > > migration code calls a new main-loop hook to perform the exec.
> > > 
> > > Example:
> > > 
> > > In this example, we simply restart the same version of QEMU, but in
> > > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > > 
> > >    # qemu-kvm -monitor stdio
> > >    -object memory-backend-memfd,id=ram0,size=1G
> > >    -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > > 
> > >    QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > >    (qemu) migrate_set_parameter mode cpr-exec
> > >    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming 
> > > file:vm.state
> > >    (qemu) migrate -d file:vm.state
> > >    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > > 
> > > Steve Sistare (9):
> > >    migration: multi-mode notifier
> > >    migration: add cpr_walk_fd
> > >    oslib: qemu_clear_cloexec
> > >    vl: helper to request exec
> > >    migration: cpr-exec-command parameter
> > >    migration: cpr-exec save and load
> > >    migration: cpr-exec mode
> > >    migration: cpr-exec docs
> > >    vfio: cpr-exec mode
> > 
> > The other thing is, as Vladimir is working on (looks like) a cleaner way of
> > passing FDs fully relying on unix sockets, I want to understand better on
> > the relationships of his work and the exec model.
> 
> His work is based on my work -- the ability to embed a file descriptor in a
> migration stream with a VMSTATE_FD declaration -- so it is compatible.
> 
> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
> integer and embedding that in the data stream.  See the changes in 
> vmstate-types.c
> in [PATCH V3 7/9] migration: cpr-exec mode.
> 
> Thus cpr-exec will still preserve tap devices via Vladimir's code.
> > I still personally think we should always stick with unix sockets, but I'm
> > open to be convinced on above limitations.  If exec is better than
> > cpr-transfer in any way, the hope is more people can and should adopt it.
> 
> Various people and companies have expressed interest in CPR and want to 
> explore
> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
> give people the option.  And Oracle continues to use cpr-exec mode.

How does cpr-exec guarantees everything will go smoothly with no failure
after the exec?  Essentially, this is Vladimir's question 1.  Feel free to
answer there, because there's also question 2 (which we used to cover some
but maybe not as much).

The other thing I don't remember if we discussed, on how cpr-exec manages
device hotplugs. Say, what happens if there are devices hot plugged (via
QMP) then cpr-exec migration happens?

Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
cmdlines and append them?  How to guarantee src/dst device topology match
exactly the same with the new cmdline?

> 
> There is no downside to supporting cpr-exec mode.  It is astonishing how much
> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
> this series is factored into specific cpr-exec files and functions, code that
> will never run for any other reason.  There are very few conditionals in 
> common
> code that do something different for exec mode.
> > We also have no answer yet on how cpr-exec can resolve container world with
> > seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> > downside instead.  Better mention that in the cover letter.
> The key is limiting the contents of the container, so exec only has a limited
> and known safe set of things to target.  I'll add that to the cover letter.

Thanks.

-- 
Peter Xu


Reply via email to