On 2020-06-27 12:55, Cong Wang wrote:
On Fri, Jun 26, 2020 at 5:32 PM Sean Tranchetti
<stran...@codeaurora.org> wrote:
A potential deadlock can occur during registering or unregistering a
new
generic netlink family between the main nl_table_lock and the cb_lock
where
each thread wants the lock held by the other, as demonstrated below.
1) Thread 1 is performing a netlink_bind() operation on a socket. As
part
of this call, it will call netlink_lock_table(), incrementing the
nl_table_users count to 1.
2) Thread 2 is registering (or unregistering) a genl_family via the
genl_(un)register_family() API. The cb_lock semaphore will be taken
for
writing.
3) Thread 1 will call genl_bind() as part of the bind operation to
handle
subscribing to GENL multicast groups at the request of the user. It
will
attempt to take the cb_lock semaphore for reading, but it will fail
and
be scheduled away, waiting for Thread 2 to finish the write.
4) Thread 2 will call netlink_table_grab() during the (un)registration
call. However, as Thread 1 has incremented nl_table_users, it will
not
be able to proceed, and both threads will be stuck waiting for the
other.
To avoid this scenario, the locks should be acquired in the same order
by
both threads. Since both the register and unregister functions need to
take
the nl_table_lock in their processing, it makes sense to explicitly
acquire
them before they lock the genl_mutex and the cb_lock. In
unregistering, no
other change is needed aside from this locking change.
Like the kernel test robot reported, you can not call genl_lock_all
while
holding netlink_table_grab() which is effectively a write lock.
To me, it seems genl_bind() can be just removed as there is no one
in-tree uses family->mcast_bind(). Can you test the attached patch?
It seems sufficient to fix this deadlock.
Thanks.
Thanks Cong. Yes, removing the genl_bind()/genl_unbind() functions
eliminates the
potential for this deadlock. Adding Johannes here to comment on removing
these,
as the family->mcast_bind() capability added by commit c380d9a7afff
("genetlink: pass multicast bind/unbind to families") would be lost.