On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
>
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
>
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
>
> blk_mq_flush_plug_list():
> blk_mq_sched_insert_requests()
> insert requests to sw queue or scheduler queue
> blk_mq_run_hw_queue
>
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
>
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
>
> Cc: Dongli Zhang <[email protected]>
> Cc: James Smart <[email protected]>
> Cc: Bart Van Assche <[email protected]>
> Cc: [email protected],
> Cc: Martin K . Petersen <[email protected]>,
> Cc: Christoph Hellwig <[email protected]>,
> Cc: James E . J . Bottomley <[email protected]>,
> Reviewed-by: Bart Van Assche <[email protected]>
> Reviewed-by: Johannes Thumshirn <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Tested-by: James Smart <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
I added my "Reviewed-by" to a previous version of this patch but not
to this version of this patch. Several "Reviewed-by" tags probably
should be removed.
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index aa6bc5c02643..dfe83e7935d6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx
> *hctx,
> {
> struct elevator_queue *e;
>
> + /*
> + * blk_mq_sched_insert_requests() is called from flush plug
> + * context only, and hold one usage counter to prevent queue
> + * from being released.
> + */
> + percpu_ref_get(&hctx->queue->q_usage_counter);
> +
> e = hctx->queue->elevator;
> if (e && e->type->ops.insert_requests)
> e->type->ops.insert_requests(hctx, list, false);
> @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx
> *hctx,
> }
>
> blk_mq_run_hw_queue(hctx, run_queue_async);
> +
> + percpu_ref_put(&hctx->queue->q_usage_counter);
> }
I think that 'hctx' can disappear if all requests queued by this function
finish just before blk_mq_run_hw_queue() returns and if the number of hardware
queues is changed from another thread. Shouldn't the request queue pointer be
stored in a local variable instead of reading hctx->queue twice?
Bart.