Am 31.03.2020 um 18:18 hat Dietmar Maurer geschrieben: > > > Looks bdrv_parent_drained_poll_single() calls > > > blk_root_drained_poll(), which return true in my case (in_flight > 5). > > > > Can you identify which BlockBackend is this? Specifically if it's the > > one attached to a guest device or whether it belongs to the block job. > > This can trigger from various different places, but the simplest case is when > its called from drive_backup_prepare > > > bdrv_drained_begin(bs); > > which is the backup source drive.
I mean the BlockBackend for which blk_root_drained_poll() is called. > > Maybe have a look at the job coroutine, too. You can probably easiest > > find it in the 'jobs' list, and then print the coroutine backtrace for > > job->co. > > There is in drive_backup_prepare(), before the job gets created. Oh, I see. Then it can't be job BlockBackend, of course. > > > Looks like I am loosing poll events somewhere? > > > > I don't think we've lost any event if in_flight > 0. It means that > > something is still supposedly active. Maybe the job deadlocked. > > This is a simple call to bdrv_drained_begin(bs) (before we setup the job). > > I really nobody else able to reproduce this (somebody already tried to > reproduce)? I can get hangs, but that's for job_completed(), not for starting the job. Also, my hangs have a non-empty bs->tracked_requests, so it looks like a different case to me. In my case, the hanging requests looks like this: (gdb) qemu coroutine 0x556e055750e0 #0 0x0000556e03999150 in qemu_coroutine_switch (from_=from_@entry=0x556e055750e0, to_=to_@entry=0x7fd34bbeb5b8, action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:218 #1 0x0000556e03997e31 in qemu_coroutine_yield () at util/qemu-coroutine.c:193 #2 0x0000556e0397fc88 in thread_pool_submit_co (pool=0x7fd33c003120, func=func@entry=0x556e038d59a0 <handle_aiocb_rw>, arg=arg@entry=0x7fd2d2b96440) at util/thread-pool.c:289 #3 0x0000556e038d511d in raw_thread_pool_submit (bs=bs@entry=0x556e04e459b0, func=func@entry=0x556e038d59a0 <handle_aiocb_rw>, arg=arg@entry=0x7fd2d2b96440) at block/file-posix.c:1894 #4 0x0000556e038d58c3 in raw_co_prw (bs=0x556e04e459b0, offset=230957056, bytes=4096, qiov=0x7fd33c006fe0, type=1) at block/file-posix.c:1941 Checking the thread pool request: (gdb) p *((ThreadPool*)0x7fd33c003120).head .lh_first $9 = {common = {aiocb_info = 0x556e03f43f80 <thread_pool_aiocb_info>, bs = 0x0, cb = 0x556e0397f670 <thread_pool_co_cb>, opaque = 0x7fd2d2b96400, refcnt = 1}, pool = 0x7fd33c003120, func = 0x556e038d59a0 <handle_aiocb_rw>, arg = 0x7fd2d2b96440, state = THREAD_DONE, ret = 0, reqs = {tqe_next = 0x0, tqe_circ = {tql_next = 0x0, tql_prev = 0x0}}, all = {le_next = 0x0, le_prev = 0x7fd33c0031d0}} So apparently the request is THREAD_DONE, but the coroutine was never reentered. I saw one case where ctx.bh_list was empty, but I also have a case where a BH sits there scheduled and apparently just doesn't get run: (gdb) p *((ThreadPool*)0x7fd33c003120).ctx.bh_list .slh_first $13 = {ctx = 0x556e04e41a10, cb = 0x556e0397f8e0 <thread_pool_completion_bh>, opaque = 0x7fd33c003120, next = {sle_next = 0x0}, flags = 3} Stefan, I wonder if this is related to the recent changes to the BH implementation. Kevin