Hi, On Wed, 2017-11-29 at 08:14 -0800, Alexander Duyck wrote: > On Wed, Nov 29, 2017 at 6:25 AM, Paolo Abeni <pab...@redhat.com> wrote: > > Currently deleting qdisc with a large number of children and filters > > can take a lot of time: > > > > tc qdisc add dev lo root htb > > for I in `seq 1 1000`; do > > tc class add dev lo parent 1: classid 1:$I htb rate 100kbit > > tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb > > for J in `seq 1 10`; do > > tc filter add dev lo parent $((I + 1)): u32 match ip src > > 1.1.1.$J > > done > > done > > time tc qdisc del dev root > > > > real 0m54.764s > > user 0m0.023s > > sys 0m0.000s > > > > This is due to the multiple rcu_barrier() calls, one for each tcf_block > > freed, invoked with the rtnl lock held. Most other network related > > tasks will block in this timespan. > > > > This change implements bulk free of tcf_block() at destroy() time: > > when deleting a qdisc hierarchy, the to-be-deleted blocks are queued > > in a rtnl_lock protected list, and a single rcu_barrier is invoked > > for each burst. > > > > The burst is flushed after the deletion of the topmost qdisc of the > > destroyed hierarchy and all the queued blocks are deleted with a > > single delayed work call. > > > > After this change, the previous example gives: > > > > real 0m0.193s > > user 0m0.000s > > sys 0m0.016s > > > > Signed-off-by: Paolo Abeni <pab...@redhat.com> > > While I agree there is an issue I don't think this is being approached > quite the right way. The question I have is why aren't we using the > standard RCU approach for this and simply using either call_rcu or > kfree_rcu to free the object? > > > --- > > This patch adds 2 new list_head fields to tcf_block, that could be > > replaced with a single pointer, open coding single linked list > > manipulation in cls_api.c, if a lower memory impact is required. > > --- > > include/net/pkt_cls.h | 1 + > > include/net/sch_generic.h | 5 +++ > > net/sched/cls_api.c | 78 > > +++++++++++++++++++++++++++++++++++------------ > > net/sched/sch_api.c | 1 + > > net/sched/sch_generic.c | 17 +++++++++++ > > net/sched/sch_ingress.c | 1 + > > 6 files changed, 83 insertions(+), 20 deletions(-) > > > > diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h > > index 0105445cab83..12777cfae77c 100644 > > --- a/include/net/pkt_cls.h > > +++ b/include/net/pkt_cls.h > > @@ -45,6 +45,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct > > Qdisc *q, > > void tcf_block_put(struct tcf_block *block); > > void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q, > > struct tcf_block_ext_info *ei); > > +void tcf_flush_blocks(void); > > > > static inline struct Qdisc *tcf_block_q(struct tcf_block *block) > > { > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > > index 65d0d25f2648..99846ee644a8 100644 > > --- a/include/net/sch_generic.h > > +++ b/include/net/sch_generic.h > > @@ -71,6 +71,7 @@ struct Qdisc { > > * qdisc_tree_decrease_qlen() should > > stop. > > */ > > #define TCQ_F_INVISIBLE 0x80 /* invisible by default in > > dump */ > > +#define TCQ_F_DELETING 0x100 > > u32 limit; > > const struct Qdisc_ops *ops; > > struct qdisc_size_table __rcu *stab; > > @@ -279,6 +280,10 @@ struct tcf_block { > > struct Qdisc *q; > > struct list_head cb_list; > > struct work_struct work; > > + > > + /* TODO: use a single list, do avoid wasting too much memory */ > > + struct list_head del_list; > > + struct list_head del_head; > > }; > > > > This is just adding yet another layer of deferred freeing. We already > have RCU why don't we just use that? > > > static inline void qdisc_cb_private_validate(const struct sk_buff *skb, > > int sz) > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > > index 7d97f612c9b9..446b16c1f532 100644 > > --- a/net/sched/cls_api.c > > +++ b/net/sched/cls_api.c > > @@ -37,6 +37,61 @@ static LIST_HEAD(tcf_proto_base); > > /* Protects list of registered TC modules. It is pure SMP lock. */ > > static DEFINE_RWLOCK(cls_mod_lock); > > > > +/* List of tcf_blocks queued for deletion. Bulk freeing them we avoid the > > + * rcu_barrier() storm at root_qdisc->destroy() time > > + */ > > +static LIST_HEAD(queued_blocks); > > + > > +static void queue_for_deletion(struct tcf_block *block) > > +{ > > + if (WARN_ON(!list_empty(&block->del_list))) > > + return; > > + > > + ASSERT_RTNL(); > > + list_add(&block->del_list, &queued_blocks); > > +} > > + > > +static void flush_blocks(struct work_struct *work) > > +{ > > + struct tcf_block *block, *tmp_block; > > + struct tcf_chain *chain, *tmp; > > + struct list_head *head; > > + > > + head = &container_of(work, struct tcf_block, work)->del_head; > > + rtnl_lock(); > > + list_for_each_entry(block, head, del_list) > > + /* Only chain 0 should be still here. */ > > + list_for_each_entry_safe(chain, tmp, &block->chain_list, > > list) > > + tcf_chain_put(chain); > > + rtnl_unlock(); > > + > > + list_for_each_entry_safe(block, tmp_block, head, del_list) > > + kfree(block); > > +} > > + > > +void tcf_flush_blocks(void) > > +{ > > + struct tcf_block *head; > > + LIST_HEAD(flush_burst); > > + > > + ASSERT_RTNL(); > > + if (list_empty(&queued_blocks)) > > + return; > > + > > + head = list_first_entry(&queued_blocks, struct tcf_block, del_list); > > + INIT_LIST_HEAD(&head->del_head); > > + list_splice_init(&queued_blocks, &head->del_head); > > + INIT_WORK(&head->work, flush_blocks); > > + > > + /* Wait for existing RCU callbacks to cool down, make sure their > > works > > + * have been queued before this. We can not flush pending works here > > + * because we are holding the RTNL lock. > > + */ > > + rcu_barrier(); > > + schedule_work(&head->work); > > +} > > +EXPORT_SYMBOL_GPL(tcf_flush_blocks); > > + > > /* Find classifier type by string name */ > > > > static const struct tcf_proto_ops *tcf_proto_lookup_ops(const char *kind) > > @@ -288,6 +343,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, > > struct Qdisc *q, > > return -ENOMEM; > > INIT_LIST_HEAD(&block->chain_list); > > INIT_LIST_HEAD(&block->cb_list); > > + INIT_LIST_HEAD(&block->del_list); > > > > /* Create chain 0 by default, it has to be always present. */ > > chain = tcf_chain_create(block, 0); > > @@ -330,19 +386,6 @@ int tcf_block_get(struct tcf_block **p_block, > > } > > EXPORT_SYMBOL(tcf_block_get); > > > > -static void tcf_block_put_final(struct work_struct *work) > > -{ > > - struct tcf_block *block = container_of(work, struct tcf_block, > > work); > > - struct tcf_chain *chain, *tmp; > > - > > - rtnl_lock(); > > - /* Only chain 0 should be still here. */ > > - list_for_each_entry_safe(chain, tmp, &block->chain_list, list) > > - tcf_chain_put(chain); > > So it seems like the heart of the problem is right here. The > tcf_chain_put call is updating the reference count and I would assume > that is the only bit that really needs you to still be holding onto > the rtnl_lock. > > The question I would have is if there is anything accessing the > reference count or manipulating the list itself without holding the > rtnl lock? If not you could look at converting this whole thing from > using a list to an rculist and it seems like it would save you a lot > of trouble. As far as I can tell the only thing you would then really > have to worry about would be the freeing of the chain itself which > would happen in an rcu callback instead of with the rtnl lock held.
Thank you for the feedback! Big fat disclaimer: I'm not 100% sure I fully understood the locking schema currently in use (ence the RFC tag), so please Jamal/Cong/Jiri correct me if/where needed. AFAIU, there are some more context information: tcf_block_put_final() and the like must be invoked after that the related filters are freed. Filter themself must be destroyed synchronously at namespace deletion time, but must respect RCU grace period elsewhere - see comments in tcf_exts_get_net(). The rcu_barrier() enforces the proper ordering in the latter case, and the rtnl_lock protects vs concurrency in the first case. Cheers, Paolo