* Steven Sistare (steven.sist...@oracle.com) wrote: > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: > > * Steven Sistare (steven.sist...@oracle.com) wrote: > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >>>>>> Provide the cprsave and cprload commands for live update. These save > >>>>>> and > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be > >>>>>> updated > >>>>>> to a new version in between. > >>>>>> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It > >>>>>> supports two > >>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu > >>>>>> binary (or > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in > >>>>>> a > >>>>>> paused state and waits for the cprload command. > >>>>> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code > >>>>> already opens fds passed using this mechanism. > >>>>> > >>>>> I haven't checked but it may be possible to drop some patches by reusing > >>>>> QEMU's monitor file descriptor passing since the code already knows how > >>>>> to open from 'getfd' fds. > >>>>> > >>>>> The reason why using QMP is interesting is because it eliminates the > >>>>> need for execve(2). QEMU may be unable to execute a program due to > >>>>> chroot, seccomp, etc. > >>>>> > >>>>> QMP would enable cprsave/cprload to work both with and without > >>>>> execve(2). > >>>>> > >>>>> One tricky thing with this approach might be startup ordering: how to > >>>>> get fds via the QMP monitor in the new process before processing the > >>>>> entire command-line. > >>>> > >>>> Early on I experimented with a similar approach. Old qemu passed > >>>> descriptors to an > >>>> escrow process and exited; new qemu started and retrieved the > >>>> descriptors from escrow. > >>>> vfio mostly worked after I hacked the kernel to suppress the > >>>> original-pid owner check. > >>>> I suspect my recent vfio extensions would smooth the rough edges. > >>> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it > >>> pins pages from the original process? > >> > >> The dma unmap code verifies that the requesting task is the same as the > >> task that mapped > >> the pages. We could add an ioctl that passes ownership to a new task. We > >> would also need > >> to fix locked memory accounting, which is associated with the mm of the > >> original task. > > > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU > >>> execve(2) the new binary? > >> > >> That is one. Plus, re-attaching to named shared memory for pc.ram causes > >> the vfio conflict > >> errors I mentioned in the previous email. We would need to suppress > >> redundant dma map calls, > >> but allow legitimate dma maps and unmaps in response to the ongoing > >> address space changes and > >> diff callbacks caused by some drivers. It would be messy and fragile. In > >> general, it felt like > >> I was working against vfio rather than with it. > > > > OK the weirdness of vfio helps explain a bit about why you're doing it > > this way; can you help separate some difference between restart and > > reboot for me though: > > > > In 'reboot' mode; where the guest must do suspend in it's drivers, how > > much of these vfio requirements are needed? I guess the memfd use > > for the anonymous areas isn't any use for reboot mode. > > Correct. For reboot no special vfio support or fiddling is needed. > > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does > > vfio still care about the currently-anonymous areas? > > Yes, for restart mode. The physical pages behind the anonymous memory remain > pinned and > are targets for ongoing DMA. Post-exec qemu needs a way to find those same > pages.
Is it possible with vfio to map it into multiple processes simultaneously or does it have to only be one at a time? Are you saying that you have no way to shut off DMA, and thus you can never know it's safe to terminate the source process? > >> Another big reason is a requirement to preserve anonymous memory for > >> legacy qemu updates (via > >> code injection which I briefly mentioned in KVM forum). If we extend cpr > >> to allow updates > >> without exec, I still need the exec option. > > > > Can you explain what that code injection mechanism is for those of us > > who didn't see that? > > Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which > was not > accepted upstream. In this series, without MADV_DOEXEC, how do you guarantee the same HVA in source and destination - or is that not necessary? > ----------------------------------------------------------------------------- > Legacy Live Update > > * Update legacy qemu process to latest version > - Inject code into legacy qemu process to perform cprsave: vmsave.so > . Access qemu data structures and globals > - eg ram_list, savevm_state, chardevs, vhost_devices > - dlopen does not resolve them, must get addresses via symbol lookup. > . Delete some vmstate handlers, register new ones (eg vfio) > . Call MADV_DOEXEC on guest memory. Find devices, preserve fd > * Hot patch a monitor function to dlopen vmsave.so, call entry point > - write patch to /proc/pid/mem > - Call the monitor function via monitor socket > * Send cprload to update qemu > * vmsave.so has binary dependency on qemu data structures and variables > - Build vmsave-ver.so per legacy version > - Indexed by qemu's gcc build-id > > ----------------------------------------------------------------------------- That's hairy! At that point isn't it easier to recompile a patched qemu against the original sources and ptrace something in to mmap the new qemu? Dave > - Steve > > >>>> However, the main issue is that guest ram must be backed by named shared > >>>> memory, and > >>>> we would need to add code to support shared memory for all the secondary > >>>> memory objects. > >>>> That makes it less interesting for us at this time; we care about > >>>> updating legacy qemu > >>>> instances with anonymous guest memory. > >>> > >>> Thanks for explaining this more in the other sub-thread. The secondary > >>> memory objects you mentioned are relatively small so I don't think > >>> saving them in the traditional way is a problem. > >>> > >>> Two approaches for zero-copy memory migration fit into QEMU's existing > >>> migration infrastructure: > >>> > >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > >>> etc) so they are not saved into the savevm file. The existing --object > >>> memory-backend-file syntax can be used. > >>> > >>> - Extending the live migration protocol to detect when file descriptor > >>> passing is available (i.e. UNIX domain socket migration) and using > >>> that for memory-backend-* objects that have fds. > >>> > >>> Either of these approaches would handle RAM with existing savevm/migrate > >>> commands. > >> > >> Yes, but the vfio issues would still need to be solved, and we would need > >> new > >> command line options to back existing and future secondary memory objects > >> with > >> named shared memory. > >> > >>> The remaining issue is how to migrate VFIO and other file descriptors > >>> that cannot be reopened by the new process. As mentioned, QEMU already > >>> has file descriptor passing support in the QMP monitor and support for > >>> opening passed file descriptors (see qemu_open_internal(), > >>> monitor_fd_param(), and socket_get_fd()). > >>> > >>> The advantage of integrating live update functionality into the existing > >>> savevm/migrate commands is that it will work in more use cases with > >>> less code duplication/maintenance/bitrot prevention than the > >>> special-case cprsave command in this patch series. > >>> > >>> Maybe there is a fundamental technical reason why live update needs to > >>> be different from QEMU's existing migration commands but I haven't > >>> figured it out yet. > >> > >> vfio and anonymous memory. > >> > >> Regarding code duplication, I did consider whether to extend the migration > >> syntax and implementation versus creating something new. Those functions > >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the > >> cpr > >> use case, and the cpr functions handle state that is n/a for the migration > >> case. > >> I judged that handling both in the same functions would be less readable > >> and > >> maintainable. After feedback during the V1 review, I simplified the > >> cprsave > >> code by by calling qemu_save_device_state, as Xen does, thus eliminating > >> any > >> interaction with the migration code. > >> > >> Regarding bit rot, I still need to add a cpr test to the test suite, when > >> the > >> review is more complete and folks agree on the final form of the > >> functionality. > >> > >> I do like the idea of supporting update without exec, but as a future > >> project, > >> and not at the expense of dropping update with exec. > >> > >> - Steve > >> > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK