On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote: > On 9/5/2025 12:48 PM, Peter Xu wrote: > > Add Vladimir and Dan. > > > > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote: > > > This patch series adds the live migration cpr-exec mode. > > > > > > The new user-visible interfaces are: > > > * cpr-exec (MigMode migration parameter) > > > * cpr-exec-command (migration parameter) > > > > > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the > > > primary difference being that old QEMU directly exec's new QEMU. The user > > > specifies the command to exec new QEMU in the migration parameter > > > cpr-exec-command. > > > > > > Why? > > > > > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU > > > container and its assigned resources. By contrast, cpr-transfer mode > > > requires a new container to be created on the same host as the target of > > > the CPR operation. Resources must be reserved for the new container, > > > while > > > the old container still reserves resources until the operation completes. > > > Avoiding over commitment requires extra work in the management layer. > > > > Can we spell out what are these resources? > > > > CPR definitely relies on completely shared memory. That's already not a > > concern. > > > > CPR resolves resources that are bound to devices like VFIO by passing over > > FDs, these are not over commited either. > > > > Is it accounting QEMU/KVM process overhead? That would really be trivial, > > IMHO, but maybe something else? > > Accounting is one issue, and it is not trivial. Another is arranging > exclusive > use of a set of CPUs, the same set for the old and new container, > concurrently. > Another is avoiding namespace conflicts, the kind that make localhost > migration > difficult. > > > > This is one reason why a cloud provider may prefer cpr-exec. A second > > > reason > > > is that the container may include agents with their own connections to the > > > outside world, and such connections remain intact if the container is > > > reused. > > > > We discussed about this one. Personally I still cannot understand why this > > is a concern if the agents can be trivially started as a new instance. But > > I admit I may not know the whole picture. To me, the above point is more > > persuasive, but I'll need to understand which part that is over-commited > > that can be a problem. > > Agents can be restarted, but that would sever the connection to the outside > world. With cpr-transfer or any local migration, you would need agents > outside of old and new containers that persist. > > With cpr-exec, connections can be preserved without requiring the end user > to reconnect, and can be done trivially, by preserving chardevs. With that > support in qemu, the management layer does nothing extra to preserve them. > chardev support is not part of this series but is part of my vision, > and makes exec mode even more compelling. > > Management layers have a lot of code and complexity to manage live migration, > resources, and connections. It requires modification to support cpr-transfer. > All that can be bypassed with exec mode. Less complexity, less maintainance, > and fewer points of failure. I know this because I implemented exec mode in > OCI at Oracle, and we use it in production.
I wonders how this part works in Vladimir's use case. > > After all, cloud hosts should preserve some extra memory anyway to make > > sure dynamic resources allocations all the time (e.g., when live migration > > starts, KVM pgtables can drastically increase if huge pages are enabled, > > for PAGE_SIZE trackings), I assumed the over-commit portion should be less > > that those.. and when it's also temporary (src QEMU will release all > > resources after live upgrade) then it looks manageable. >> > > > How? > > > > > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag, > > > and by sending the unique name and value of each descriptor to new QEMU > > > via CPR state. > > > > > > CPR state cannot be sent over the normal migration channel, because > > > devices > > > and backends are created prior to reading the channel, so this mode sends > > > CPR state over a second migration channel that is not visible to the user. > > > New QEMU reads the second channel prior to creating devices or backends. > > > > > > The exec itself is trivial. After writing to the migration channels, the > > > migration code calls a new main-loop hook to perform the exec. > > > > > > Example: > > > > > > In this example, we simply restart the same version of QEMU, but in > > > a real scenario one would use a new QEMU binary path in cpr-exec-command. > > > > > > # qemu-kvm -monitor stdio > > > -object memory-backend-memfd,id=ram0,size=1G > > > -machine memory-backend=ram0 -machine aux-ram-share=on ... > > > > > > QEMU 10.1.50 monitor - type 'help' for more information > > > (qemu) info status > > > VM status: running > > > (qemu) migrate_set_parameter mode cpr-exec > > > (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming > > > file:vm.state > > > (qemu) migrate -d file:vm.state > > > (qemu) QEMU 10.1.50 monitor - type 'help' for more information > > > (qemu) info status > > > VM status: running > > > > > > Steve Sistare (9): > > > migration: multi-mode notifier > > > migration: add cpr_walk_fd > > > oslib: qemu_clear_cloexec > > > vl: helper to request exec > > > migration: cpr-exec-command parameter > > > migration: cpr-exec save and load > > > migration: cpr-exec mode > > > migration: cpr-exec docs > > > vfio: cpr-exec mode > > > > The other thing is, as Vladimir is working on (looks like) a cleaner way of > > passing FDs fully relying on unix sockets, I want to understand better on > > the relationships of his work and the exec model. > > His work is based on my work -- the ability to embed a file descriptor in a > migration stream with a VMSTATE_FD declaration -- so it is compatible. > > The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd > integer and embedding that in the data stream. See the changes in > vmstate-types.c > in [PATCH V3 7/9] migration: cpr-exec mode. > > Thus cpr-exec will still preserve tap devices via Vladimir's code. > > I still personally think we should always stick with unix sockets, but I'm > > open to be convinced on above limitations. If exec is better than > > cpr-transfer in any way, the hope is more people can and should adopt it. > > Various people and companies have expressed interest in CPR and want to > explore > cpr-exec. Vladimir was one, he chose transfer instead, and that is fine, but > give people the option. And Oracle continues to use cpr-exec mode. How does cpr-exec guarantees everything will go smoothly with no failure after the exec? Essentially, this is Vladimir's question 1. Feel free to answer there, because there's also question 2 (which we used to cover some but maybe not as much). The other thing I don't remember if we discussed, on how cpr-exec manages device hotplugs. Say, what happens if there are devices hot plugged (via QMP) then cpr-exec migration happens? Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into cmdlines and append them? How to guarantee src/dst device topology match exactly the same with the new cmdline? > > There is no downside to supporting cpr-exec mode. It is astonishing how much > code is shared by the cpr-transfer and cpr-exec modes. Most of the code in > this series is factored into specific cpr-exec files and functions, code that > will never run for any other reason. There are very few conditionals in > common > code that do something different for exec mode. > > We also have no answer yet on how cpr-exec can resolve container world with > > seccomp forbidding exec. I guess that's a no-go. It's definitely a > > downside instead. Better mention that in the cover letter. > The key is limiting the contents of the container, so exec only has a limited > and known safe set of things to target. I'll add that to the cover letter. Thanks. -- Peter Xu
