On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote: > But we have no requirement to send all init_bytes before stop-copy. > This is a hack to achieve a theoretical benefit that a driver might be > able to improve the latency on the target by completing another > iteration.
I think this is another half-step at this point.. The goal is to not stop the VM until the target VFIO driver has completed loading initial_bytes. This signals that the time consuming pre-setup is completed in the device and we don't have to use downtime to do that work. We've measured this in our devices and the time-shift can be significant, like seconds levels of time removed from the downtime period. Stopping the VM before this pre-setup is done is simply extending the stopped VM downtime. Really what we want is to have the far side acknowledge that initial_bytes has completed loading. To remind, what mlx5 is doing here with precopy is time-shifting work, not data. We want to put expensive work (ie time) into the period when the VM is still running and have less downtime. This challenges the assumption built into qmeu that all data has equal time and it can estimate downtime time simply by scaling the estimated data. We have a data-size independent time component to deal with as well. Jason
