On 5/27/2024 1:45 PM, Peter Xu wrote:
On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
I understand, thanks. If I can help with any of your todo list,
just ask - steve
Thanks for offering the help, Steve. Started looking at this today, then I
found that I miss something high-level. Let me ask here, and let me
apologize already for starting to throw multiple questions..
IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.
Is there any justification on why the complexity is needed here? It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile. 1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.
For example, what's the major motivation of this whole work? Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else? I can't find them in
whatever cover letter I can find, including this one.
Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps
have some number on hand when you started working on this feature years
ago? Or maybe some old links on the list would help too, as I didn't
follow this work since the start.
On VFIO, IIUC you started out this project without VFIO migration being
there. Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned. However it'll be nice to discuss
this too if this is the major purpose of the series.
I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner. IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any. E.g. could
the VFIO handle be passed over using unix scm rights? I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).
You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets? Is this a problem only because this solution will
do exec? Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?
Meanwhile, could you elaborate a bit on the implication on chardevs? From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue.. Is it also relevant to the
exec() approach?
In all cases, some of such discussion would be really appreciated. And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way. Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.
The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough. It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services. cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system. Live migration
also consumes more system and network resources. cpr-exec seamlessly
preserves client connections by preserving chardevs, and overall provides
a much nicer user experience.
chardev's are preserved by keeping their fd open across the exec, and
remembering the value of the fd in precreate vmstate so that new qemu
can associate the fd with the chardev rather than opening a new one.
The approach of preserving open file descriptors is very general and applicable
to all kinds of devices, regardless of whether they support live migration
in hardware. Device fd's are preserved using the same mechanism as for
chardevs.
Devices that support live migration in hardware do not like to live migrate
in place to the same node. It is not what they are designed for, and some
implementations will flat out fail because the source and target interfaces
are the same.
For vhost/tap, sometimes the management layer opens the dev and passes an
fd to qemu, and sometimes qemu opens the dev. The upcoming vhost/tap support
allows both. For the case where qemu opens the dev, the fd is preserved
using the same mechanism as for chardevs.
The fundamental requirements of this work are:
- precreate vmstate
- preserve open file descriptors
Direct exec from old to new qemu is not a hard requirement. However,
it is simple, with few complications, and works with Oracle's cloud
containers, so it is the method I am most interested in finishing first.
I believe everything could also be made to work by using SCM_RIGHTS to
send fd's to a new qemu process that is started by some external means.
It would be requested with MIG_MODE_CPR_SCM (or some better name), and
would co-exist with MIG_MODE_CPR_EXEC.
- Steve