On Wed, 20 May 2020 00:23:48 +0200 Thomas Gleixner <t...@linutronix.de> wrote:
> Stephen Hemminger <step...@networkplumber.org> writes: > > On Tue, 19 May 2020 23:45:23 +0200 > > "Ahmed S. Darwish" <a.darw...@linutronix.de> wrote: > > > >> Sequence counters write paths are critical sections that must never be > >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > >> > >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > >> netdev name retrieval.") handled a deadlock, observed with > >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > >> infinitely spinning: it got scheduled after the seqcount write side > >> blocked inside its own critical section. > >> > >> To fix that deadlock, among other issues, the commit added a > >> cond_resched() inside the read side section. While this will get the > >> non-preemptible kernel eventually unstuck, the seqcount reader is fully > >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > >> > >> The fix is also still broken: if the seqcount reader belongs to a > >> real-time scheduling policy, it can spin forever and the kernel will > >> livelock. > >> > >> Disabling preemption over the seqcount write side critical section will > >> not work: inside it are a number of GFP_KERNEL allocations and mutex > >> locking through the drivers/base/ :: device_rename() call chain. > >> > >> From all the above, replace the seqcount with a rwsem. > >> > >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and > >> netdev name retrieval.) > >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to > >> return an interface name) > >> Cc: <sta...@vger.kernel.org> > >> Signed-off-by: Ahmed S. Darwish <a.darw...@linutronix.de> > >> Reviewed-by: Sebastian Andrzej Siewior <bige...@linutronix.de> > > > > Have your performance tested this with 1000's of network devices? > > No. We did not. -ENOTESTCASE Please try, it isn't that hard.. # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done real 0m17.002s user 0m1.064s sys 0m0.375s