On Tue, May 23, 2017 at 12:32 PM, Nithin Sujir <nsu...@tintri.com> wrote: > Hi, > We're encountering a problem in 4.4 LTS where, rarely, the bond link state > is not updated when the slave link changes. > > I've traced the issue to the arp monitor unable to get the rtnl lock. The > sequence resulting in failure is as below. > > bond_loadbalance_arp_mon() periodically called, if slave link is _down_, it > checks if the slave is sending/receiving packets. If it is, it sets flags to > be processed later down the function for bond link update. However, it sets > the slave->link right away. > > if (slave->link != BOND_LINK_UP) { > if (bond_time_in_interval(bond, trans_start, 1) && > bond_time_in_interval(bond, slave->last_rx, 1)) > { > > slave->link = BOND_LINK_UP; > slave_state_changed = 1; > > > Later down the function, it tries to get the rtnl_lock. If it doesn't get > it, it rearms and returns. > > if (do_failover || slave_state_changed) { > if (!rtnl_trylock()) > goto re_arm; <-- returns here > > if (slave_state_changed) { > bond_slave_state_change(bond); > > This is the problem. The next time this function is called, the slave->link > is already marked UP. And we will never update the bond link state to UP. > > Changing the rtnl_trylock() -> rtnl_lock() _does_ fix the issue. > > Is this the right way to fix it? If it is, I can submit this formally. > > What are the guidelines around using rtnl_lock() vs rtnl_trylock()? Some > places are using rtnl_lock() and other rtnl_trylock(). Sorry, I couldn't > find much via a google search or in Documentation/. > > Thanks, > Nithin. > > -------------------- > > diff --git a/drivers/net/bonding/bond_main.c > b/drivers/net/bonding/bond_main.c > index 5dca77e..1f60503 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -2614,8 +2614,7 @@ static void bond_loadbalance_arp_mon(struct > work_struct *work) > rcu_read_unlock(); > > if (do_failover || slave_state_changed) { > - if (!rtnl_trylock()) > - goto re_arm; > + rtnl_lock();
Nitin, you can't do this. The tryRTNL code is to prevent deadlock during work-cancellation during bond_close(). > > if (slave_state_changed) { > bond_slave_state_change(bond); > >