On 06/08/2020 12:17, Rasmus Villemoes wrote: > On 06/08/2020 01.34, Stephen Hemminger wrote: >> On Wed, 5 Aug 2020 16:25:23 +0200 >> Rasmus Villemoes <rasmus.villem...@prevas.dk> wrote: >> >>> Hi, >>> >>> We're seeing occasional lockups on an embedded board (running an -rt >>> kernel), which I believe I've tracked down to the >>> >>> if (!rtnl_trylock()) >>> return restart_syscall(); >>> >>> in net/bridge/br_sysfs_br.c. The problem is that some SCHED_FIFO task >>> writes a "1" to the /sys/class/net/foo/bridge/flush file, while some >>> lower-priority SCHED_FIFO task happens to hold rtnl_lock(). When that >>> happens, the higher-priority task is stuck in an eternal ERESTARTNOINTR >>> loop, and the lower-priority task never gets runtime and thus cannot >>> release the lock. >>> >>> I've written a script that rather quickly reproduces this both on our >>> target and my desktop machine (pinning everything on one CPU to emulate >>> the uni-processor board), see below. Also, with this hacky patch >> >> There is a reason for the trylock, it works around a priority inversion. > > Can you elaborate? It seems to me that it _causes_ a priority inversion > since priority inheritance doesn't have a chance to kick in. > >> The real problem is expecting a SCHED_FIFO task to be safe with this >> kind of network operation. > > Maybe. But ignoring the SCHED_FIFO/rt-prio stuff, it also seems a bit > odd to do what is essentially a busy-loop - yes, the restart_syscall() > allows signals to be delivered (including allowing the process to get > killed), but in the absence of any signals, the pattern essentially > boils down to > > while (!rtnl_trylock()) > ; > > So even for regular tasks, this seems to needlessly hog the cpu. > > I tried this > > diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c > index 0318a69888d4..e40e264f9b16 100644 > --- a/net/bridge/br_sysfs_br.c > +++ b/net/bridge/br_sysfs_br.c > @@ -44,8 +44,8 @@ static ssize_t store_bridge_parm(struct device *d, > if (endp == buf) > return -EINVAL; > > - if (!rtnl_trylock()) > - return restart_syscall(); > + if (rtnl_lock_interruptible()) > + return -ERESTARTNOINTR; > > err = (*set)(br, val); > if (!err) > > with the obvious definition of rtnl_lock_interruptible(), and it makes > the problem go away. Isn't it better to sleep waiting for the lock (and > with -rt, giving proper priority boost) or a signal to arrive rather > than busy-looping back and forth between syscall entry point and the > trylock()? > > I see quite a lot of > > if (mutex_lock_interruptible(...)) > return -ERESTARTSYS; > > but for the rtnl_mutex, I see the trylock...restart_syscall pattern > being used in a couple of places. So there must be something special > about the rtnl_mutex? > > Thanks, > Rasmus >
Hi Rasmus, I haven't tested anything but git history (and some grepping) points to deadlocks when sysfs entries are being changed under rtnl. For example check: af38f2989572704a846a5577b5ab3b1e2885cbfb and 336ca57c3b4e2b58ea3273e6d978ab3dfa387b4c This is a common usage pattern throughout net/, the bridge is not the only case and there are more commits which talk about deadlocks. Again I haven't verified anything but it seems on device delete (w/ rtnl held) -> sysfs delete would wait for current readers, but current readers might be stuck waiting on rtnl and we can deadlock. Cheers, Nik