On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote: > On Tue, Oct 24, 2017 at 6:43 PM, David Miller <da...@davemloft.net> wrote: > > From: Cong Wang <xiyou.wangc...@gmail.com> > > Date: Mon, 23 Oct 2017 15:02:49 -0700 > > > >> Recently, the RCU callbacks used in TC filters and TC actions keep > >> drawing my attention, they introduce at least 4 race condition bugs: > > > > Like Eric, I think doing a full RCU sync on every delete is too big > > a pill to swallow. This is a major control plane performance > > regression. > > > > Please find another reasonable way to fix this. > > > > Alright... I finally find a way to make everyone happy. > > My solution is introducing a workqueue for tc filters > and let each RCU callback defer the work to this > workqueue. I solve the flush_workqueue() deadlock > by queuing another work in the same workqueue > at the end, so the execution order should be as same > as it is now. The ugly part is now tcf_block_put() which > looks like below: > > > static void tcf_block_put_final(struct work_struct *work) > { > struct tcf_block *block = container_of(work, struct tcf_block, work); > struct tcf_chain *chain, *tmp; > > /* At this point, all the chains should have refcnt == 1. */ > rtnl_lock(); > list_for_each_entry_safe(chain, tmp, &block->chain_list, list) > tcf_chain_put(chain); > rtnl_unlock(); > kfree(block); > }
I am guessing that tcf_chain_put() sometimes does a call_rcu(), and the callback function in turn calls schedule_work(), and that tcf_block_put_deferred() is the workqueue handler function. > static void tcf_block_put_deferred(struct work_struct *work) > { > struct tcf_block *block = container_of(work, struct tcf_block, work); > struct tcf_chain *chain; > > rtnl_lock(); > /* Hold a refcnt for all chains, except 0, in case they are gone. */ > list_for_each_entry(chain, &block->chain_list, list) > if (chain->index) > tcf_chain_hold(chain); > > /* No race on the list, because no chain could be destroyed. */ > list_for_each_entry(chain, &block->chain_list, list) > tcf_chain_flush(chain); > > INIT_WORK(&block->work, tcf_block_put_final); > /* Wait for RCU callbacks to release the reference count and make > * sure their works have been queued before this. > */ > rcu_barrier(); This one can take awhile... Though in recent kernels it will often be a bit faster than synchronize_rcu(). Note that rcu_barrier() only waits for callbacks posted via call_rcu() before the rcu_barrier() starts, if that matters. > tcf_queue_work(&block->work); > rtnl_unlock(); > } And it looks like tcf_block_put_deferred() queues itself as work as well. Or maybe instead? > void tcf_block_put(struct tcf_block *block) > { > if (!block) > return; > > INIT_WORK(&block->work, tcf_block_put_deferred); > /* Wait for existing RCU callbacks to cool down, make sure their works > * have been queued before this. We can not flush pending works here > * because we are holding the RTNL lock. > */ > rcu_barrier(); > tcf_queue_work(&block->work); > } > > > Paul, does this make any sense to you? ;) would be surprised if I fully understand the problem to be solved, but my current guess is that the constraints are as follows: 1. Things removed must be processed in order. 2. Things removes must not be processed until a grace period has elapsed. 3. Things being processed after a grace period should not be processed concurrently with each other or with subsequent removals. 4. A given removal is not finalized until its reference count reaches zero. 5. RTNL might not be held when the reference count reaches zero. Or did I lose the thread somewhere? Thanx, Paul