Stephen Hemminger <step...@networkplumber.org> writes: > On Tue, 19 May 2020 23:45:23 +0200 > "Ahmed S. Darwish" <a.darw...@linutronix.de> wrote: > >> Sequence counters write paths are critical sections that must never be >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. >> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and >> netdev name retrieval.") handled a deadlock, observed with >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was >> infinitely spinning: it got scheduled after the seqcount write side >> blocked inside its own critical section. >> >> To fix that deadlock, among other issues, the commit added a >> cond_resched() inside the read side section. While this will get the >> non-preemptible kernel eventually unstuck, the seqcount reader is fully >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. >> >> The fix is also still broken: if the seqcount reader belongs to a >> real-time scheduling policy, it can spin forever and the kernel will >> livelock. >> >> Disabling preemption over the seqcount write side critical section will >> not work: inside it are a number of GFP_KERNEL allocations and mutex >> locking through the drivers/base/ :: device_rename() call chain. >> >> From all the above, replace the seqcount with a rwsem. >> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and >> netdev name retrieval.) >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to >> return an interface name) >> Cc: <sta...@vger.kernel.org> >> Signed-off-by: Ahmed S. Darwish <a.darw...@linutronix.de> >> Reviewed-by: Sebastian Andrzej Siewior <bige...@linutronix.de> > > Have your performance tested this with 1000's of network devices?
No. We did not. -ENOTESTCASE > The reason seqcount logic is was done here was to achieve scaleablity > and a semaphore does not scale as well. That still does not make the livelock magically going away. Just make a reader with real-time priority preempt the writer and the system stops dead. The net result is perfomance <= 0. This was observed on RT kernels without a special 1000's of network devices test case. Just for the record: This is not a RT specific problem. You can reproduce that w/o an RT kernel as well. Just run the reader with real-time scheduling policy. As much as you hate it from a performance POV the only sane rule of programming is: Correctness first. And this code clearly violates that rule. Thanks, tglx