Re: How to impove downtime of Live-Migration caused bdrv_drain_all()

Felipe Franciosi Thu, 02 Jan 2020 07:27:56 -0800


> On Jan 2, 2020, at 3:07 PM, Stefan Hajnoczi <[email protected]> wrote:
> 
> On Thu, Dec 26, 2019 at 05:40:22PM +0800, 张海斌 wrote:
>> Stefan Hajnoczi <[email protected]> 于2019年3月29日周五 上午1:08写道：
>>> 
>>> On Thu, Mar 28, 2019 at 05:53:34PM +0800, 张海斌 wrote:
>>>> hi, stefan
>>>> 
>>>> I have faced the same problem you wrote in
>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg04025.html
>>>> 
>>>> Reproduce as follow:
>>>> 1. Clone qemu code from https://git.qemu.org/git/qemu.git, add some
>>>> debug information and compile
>>>> 2. Start a new VM
>>>> 3. In VM, use fio randwrite to add pressure for disk
>>>> 4. Live migrate
>>>> 
>>>> Log show as follow:
>>>> [2019-03-28 15:10:40.206] /data/qemu/cpus.c:1086: enter do_vm_stop
>>>> [2019-03-28 15:10:40.212] /data/qemu/cpus.c:1097: call bdrv_drain_all
>>>> [2019-03-28 15:10:40.989] /data/qemu/cpus.c:1099: call 
>>>> replay_disable_events
>>>> [2019-03-28 15:10:40.989] /data/qemu/cpus.c:1101: call bdrv_flush_all
>>>> [2019-03-28 15:10:41.004] /data/qemu/cpus.c:1104: done do_vm_stop
>>>> 
>>>> Calling bdrv_drain_all() costs 792 mini-seconds.
>>>> I just add a bdrv_drain_all() at start of do_vm_stop() before
>>>> pause_all_vcpus(), but it doesn't work.
>>>> Is there any way to improve live-migration downtime cause by 
>>>> bdrv_drain_all()?
> 
> I believe there were ideas about throttling storage controller devices
> during the later phases of live migration to reduce the number of
> pending I/Os.
> 
> In other words, if QEMU's virtio-blk/scsi emulation code reduces the
> queue depth as live migration nears the handover point, bdrv_drain_all()
> should become cheaper because fewer I/O requests will be in-flight.
> 
> A simple solution would reduce the queue depth during live migration
> (e.g. queue depth 1).  A smart solution would look at I/O request
> latency to decide what queue depth is acceptable.  For example, if
> requests are taking 4 ms to complete then we might allow 2 or 3 requests
> to achieve a ~10 ms bdrv_drain_all() downtime target.
> 
> As far as I know this has not been implemented.
> 
> Do you want to try implementing this?
> 
> Stefan


It is a really hard problem to solve. Ultimately, if guarantees are
needed about the blackout period, I don't see any viable solution
other than aborting all pending storage commands.

Starting with a "go to QD=1 mode" approach is probably sensible.
Vhost-based backends could even do that off the "you need to log"
message, given that these are only used during migration.

Having a "you are taking too long, abort everything" command might be
something worth looking into, specially if we can *safely* replay them
on the other side. (That may be backend-dependent.)

F.

Re: How to impove downtime of Live-Migration caused bdrv_drain_all()

Reply via email to