From: Jay Vosburgh <jay.vosbu...@canonical.com> Date: Tue, 07 Nov 2017 19:50:07 +0900
> The bonding miimon logic has a flaw, in that a failure of the > rtnl_trylock can cause a slave to become permanently stuck in > BOND_LINK_FAIL state. > > The sequence of events to cause this is as follows: > > 1) bond_miimon_inspect finds that a slave's link is down, and so > calls bond_propose_link_state, setting slave->new_link_state to > BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns > non-zero. > > 2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is > rescheduled. No change is committed. > > 3) bond_miimon_inspect is called again, but this time the slave > from step 1 has recovered. slave->new_link is reset to NOCHANGE, and, as > slave->link was never changed, the switch enters the BOND_LINK_UP case, > and does nothing. The pending BOND_LINK_FAIL state from step 1 remains > pending, as new_link_state is not reset. > > 4) The state from step 3 persists until another slave changes link > state and causes bond_miimon_inspect to return non-zero. At this point, > the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed, > and the slave will remain stuck in BOND_LINK_FAIL state even though it > is actually link up. > > The remedy for this is to initialize new_link_state on each entry > to bond_miimon_inspect, as is already done with new_link. > > Reported-by: Alex Sidorenko <alexandre.sidore...@hpe.com> > Reviewed-by: Jarod Wilson <ja...@redhat.com> > Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP > correctly") Applied and queued up for -stable. As discussed with some others here at netdev... rtnl_trylock() really needs to be re-evaluated if not removed completely.