On Fri, Oct 06, 2023 at 12:37:15PM +0100, Joao Martins wrote: > I added the statistics mainly for observability (e.g. you would grep in the > libvirt logs for a non developer and they can understand how downtime is > explained). I wasn't specifically thinking about management app using this, > just > broad access to the metrics. > > One can get the same level of observability with a BPF/dtrace/systemtap > script, > albeit in a less obvious way.
Makes sense. > > With respect to motivation: I am doing migration with VFs and sometimes > vhost-net, and the downtime/switchover is the only thing that is either > non-determinisc or not captured in the migration math. There are some things > that aren't accounted (e.g. vhost with enough queues will give you high > downtimes), Will this be something relevant to loading of the queues? There used to be a work on greatly reducing downtime especially for virtio scenarios over multiple queues (and iirc even 1 queue also benefits from that), it wasn't merged probably because not enough review: https://lore.kernel.org/r/[email protected] Though personally I think that's some direction good to keep exploring at least, maybe some slightly enhancement to that series will work for us. > and algorithimally not really possible to account for as one needs > to account every possible instruction when we quiesce the guest (or at least > that's my understanding). > > Just having these metrics, help the developer *and* user see why such downtime > is high, and maybe open up window for fixes/bug-reports or where to improve. > > Furthermore, hopefully these tracepoints or stats could be a starting point > for > developers to understand how much downtime is spent in a particular device in > Qemu(as a follow-up to this series), Yes, I was actually expecting that when read the cover letter. :) This also makes sense. One thing worth mention is, the real downtime measured can, IMHO, differ on src/dst due to "pre_save" and "post_load" may not really doing similar things. IIUC it can happen that some device sents fast, but loads slow. I'm not sure whether there's reversed use case. Maybe we want to capture that on both sides on some metrics? > or allow to implement bounds check limits in switchover limits in way > that doesn't violate downtime-limit SLAs (I have a small set of patches > for this). I assume that decision will always be synchronized between src/dst in some way, or guaranteed to be same. But I can wait to read the series first. Thanks, -- Peter Xu
