On Tue, Feb 12, 2019 at 09:54:55PM +0100, Heiner Kallweit wrote: > On 12.02.2019 17:30, Russell King - ARM Linux admin wrote: > > On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote: > >> On 12.02.2019 04:58, Andrew Lunn wrote: > >>> That change means we don't check the PHY device if it caused an > >>> interrupt when its state is less than UP. > >>> > >>> What i'm seeing is that the PHY is interrupting pretty early on after > >>> a reboot when the previous boot had the interface up. > >>> > >> So this means that when going down for reboot the interrupts are not > >> properly masked / disabled? Because (at least for net-next) we enable > >> interrupts in phy_start() only. > > > [..] > > In looking at this, I came across this chunk of code: > > > > static inline bool __phy_is_started(struct phy_device *phydev) > > { > > WARN_ON(!mutex_is_locked(&phydev->lock)); > > > > return phydev->state >= PHY_UP; > > } > > > > /** > > * phy_is_started - Convenience function to check whether PHY is started > > * @phydev: The phy_device struct > > */ > > static inline bool phy_is_started(struct phy_device *phydev) > > { > > bool started; > > > > mutex_lock(&phydev->lock); > > started = __phy_is_started(phydev); > > mutex_unlock(&phydev->lock); > > > > return started; > > } > > > > which looks to me like over-complication. The mutex locking there is > > completely pointless - what are you trying to achieve with it? > > > > Let's go through this. The above is exactly equivalent to: > > > > bool phy_is_started(phydev) > > { > > int state; > > > > mutex_lock(&phydev->lock); > > state = phydev->state; > > mutex_unlock(&phydev->lock); > > > > return state >= PHY_UP; > > } > > > > since when we do the test is irrelevant. Architectures that Linux > > runs on are single-copy atomic, which means that reading phydev->state > > itself is an atomic operation. So, the mutex locking around that > > doesn't add to the atomicity of the entire operation. > > > > How, depending on what you do with the rest of this function depends > > whether the entire operation is safe or not. For example, let's take > > this code at the end of phy_state_machine(): > > > > if (phy_polling_mode(phydev) && phy_is_started(phydev)) > > phy_queue_state_machine(phydev, PHY_STATE_TIME); > > > > state = PHY_UP > > thread 0 thread 1 > > phy_disconnect() > > +-phy_is_started() > > phy_is_started() | > > `-phy_stop() > > +-phydev->state = PHY_HALTED > > `-phy_stop_machine() > > `-cancel_delayed_work_sync() > > phy_queue_state_machine() > > `-mod_delayed_work() > > > > At this point, the phydev->state_queue() has been added back onto the > > system workqueue despite phy_stop_machine() having been called and > > cancel_delayed_work_sync() called on it. > > > > The original code in 4.20 did not have this race condition. > > > > Basically, the lock inside phy_is_started() does nothing useful, and > > I'd say is dangerously misleading. > > > Then idea would be to first remove the locking from phy_is_started() > and in a second step do the following to prevent the described race > (phy_stop() takes phydev->lock too). > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index c1ed03800..69dc64a4d 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -957,8 +957,10 @@ void phy_state_machine(struct work_struct *work) > * state machine would be pointless and possibly error prone when > * called from phy_disconnect() synchronously. > */ > + mutex_lock(&phydev->lock); > if (phy_polling_mode(phydev) && phy_is_started(phydev)) > phy_queue_state_machine(phydev, PHY_STATE_TIME); > + mutex_unlock(&phydev->lock); > }
Yep, that approach would certainly be better. I didn't exhaustively audit the 5.0-rc code though. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up