Re: bdrv_drained_begin deadlock with io-threads

Kevin Wolf Wed, 01 Apr 2020 03:38:38 -0700

Am 31.03.2020 um 18:18 hat Dietmar Maurer geschrieben:
> > > Looks bdrv_parent_drained_poll_single() calls
> > > blk_root_drained_poll(), which return true in my case (in_flight > 5).
> > 
> > Can you identify which BlockBackend is this? Specifically if it's the
> > one attached to a guest device or whether it belongs to the block job.
> 
> This can trigger from various different places, but the simplest case is when
> its called from drive_backup_prepare 
> 
> >   bdrv_drained_begin(bs);
> 
> which is the backup source drive.


I mean the BlockBackend for which blk_root_drained_poll() is called.

> > Maybe have a look at the job coroutine, too. You can probably easiest
> > find it in the 'jobs' list, and then print the coroutine backtrace for
> > job->co.
> 
> There is in drive_backup_prepare(), before the job gets created.

Oh, I see. Then it can't be job BlockBackend, of course.

> > > Looks like I am loosing poll events somewhere?
> > 
> > I don't think we've lost any event if in_flight > 0. It means that
> > something is still supposedly active. Maybe the job deadlocked.
> 
> This is a simple call to bdrv_drained_begin(bs) (before we setup the job).
> 
> I really nobody else able to reproduce this (somebody already tried to 
> reproduce)?

I can get hangs, but that's for job_completed(), not for starting the
job. Also, my hangs have a non-empty bs->tracked_requests, so it looks
like a different case to me.

In my case, the hanging requests looks like this:

(gdb) qemu coroutine 0x556e055750e0
#0  0x0000556e03999150 in qemu_coroutine_switch 
(from_=from_@entry=0x556e055750e0, to_=to_@entry=0x7fd34bbeb5b8, 
action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:218
#1  0x0000556e03997e31 in qemu_coroutine_yield () at util/qemu-coroutine.c:193
#2  0x0000556e0397fc88 in thread_pool_submit_co (pool=0x7fd33c003120, 
func=func@entry=0x556e038d59a0 <handle_aiocb_rw>, arg=arg@entry=0x7fd2d2b96440) 
at util/thread-pool.c:289
#3  0x0000556e038d511d in raw_thread_pool_submit (bs=bs@entry=0x556e04e459b0, 
func=func@entry=0x556e038d59a0 <handle_aiocb_rw>, arg=arg@entry=0x7fd2d2b96440) 
at block/file-posix.c:1894
#4  0x0000556e038d58c3 in raw_co_prw (bs=0x556e04e459b0, offset=230957056, 
bytes=4096, qiov=0x7fd33c006fe0, type=1) at block/file-posix.c:1941

Checking the thread pool request:

(gdb) p *((ThreadPool*)0x7fd33c003120).head .lh_first
$9 = {common = {aiocb_info = 0x556e03f43f80 <thread_pool_aiocb_info>, bs = 0x0, 
cb = 0x556e0397f670 <thread_pool_co_cb>, opaque = 0x7fd2d2b96400, refcnt = 1}, 
pool = 0x7fd33c003120,
  func = 0x556e038d59a0 <handle_aiocb_rw>, arg = 0x7fd2d2b96440, state = 
THREAD_DONE, ret = 0, reqs = {tqe_next = 0x0, tqe_circ = {tql_next = 0x0, 
tql_prev = 0x0}}, all = {le_next = 0x0,
    le_prev = 0x7fd33c0031d0}}

So apparently the request is THREAD_DONE, but the coroutine was never
reentered. I saw one case where ctx.bh_list was empty, but I also have a
case where a BH sits there scheduled and apparently just doesn't get
run:

(gdb) p *((ThreadPool*)0x7fd33c003120).ctx.bh_list .slh_first
$13 = {ctx = 0x556e04e41a10, cb = 0x556e0397f8e0 <thread_pool_completion_bh>, 
opaque = 0x7fd33c003120, next = {sle_next = 0x0}, flags = 3}

Stefan, I wonder if this is related to the recent changes to the BH
implementation.

Kevin

Re: bdrv_drained_begin deadlock with io-threads

Reply via email to