Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request_module()

Vlad Buslov Mon, 18 Jan 2021 06:33:56 -0800

On Sun 17 Jan 2021 at 17:15, Jamal Hadi Salim <j...@mojatatu.com> wrote:
> On 2021-01-16 7:56 p.m., Cong Wang wrote:
>> From: Cong Wang <cong.w...@bytedance.com>
>> tcf_action_init_1() loads tc action modules automatically with
>> request_module() after parsing the tc action names, and it drops RTNL
>> lock and re-holds it before and after request_module(). This causes a
>> lot of troubles, as discovered by syzbot, because we can be in the
>> middle of batch initializations when we create an array of tc actions.
>> One of the problem is deadlock:
>> CPU 0                                        CPU 1
>> rtnl_lock();
>> for (...) {
>>    tcf_action_init_1();
>>      -> rtnl_unlock();
>>      -> request_module();
>>                              rtnl_lock();
>>                              for (...) {
>>                                tcf_action_init_1();
>>                                  -> tcf_idr_check_alloc();
>>                                 // Insert one action into idr,
>>                                 // but it is not committed until
>>                                 // tcf_idr_insert_many(), then drop
>>                                 // the RTNL lock in the _next_
>>                                 // iteration
>>                                 -> rtnl_unlock();
>>      -> rtnl_lock();
>>      -> a_o->init();
>>        -> tcf_idr_check_alloc();
>>        // Now waiting for the same index
>>        // to be committed
>>                                  -> request_module();
>>                                  -> rtnl_lock()
>>                                  // Now waiting for RTNL lock
>>                              }
>>                              rtnl_unlock();
>> }
>> rtnl_unlock();
>> This is not easy to solve, we can move the request_module() before
>> this loop and pre-load all the modules we need for this netlink
>> message and then do the rest initializations. So the loop breaks down
>> to two now:
>>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>>                  struct tc_action_ops *a_o;
>>                  a_o = tc_action_load_ops(name, tb[i]...);
>>                  ops[i - 1] = a_o;
>>          }
>>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>>                  act = tcf_action_init_1(ops[i - 1]...);
>>          }
>> Although this looks serious, it only has been reported by syzbot, so it
>> seems hard to trigger this by humans. And given the size of this patch,
>> I'd suggest to make it to net-next and not to backport to stable.
>> This patch has been tested by syzbot and tested with tdc.py by me.
>> 
>
> LGTM.
> Initially i was worried about performance impact but i found nothing
> observable. We need to add a tdc test for batch (I can share how i did
> batch testing at next meet).
>
> Tested-by: Jamal Hadi Salim <j...@mojatatu.com>
> Acked-by: Jamal Hadi Salim <j...@mojatatu.com>
>
> cheers,
> jamal


Hi,

Thanks for adding me to the thread!
I ran our performance tests with the patch applied and didn't observe
any regression.

Regards,
Vlad

Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request_module()

Reply via email to