On 09/06/2017 04:14 PM, Florian Fainelli wrote:
On 09/06/2017 03:51 PM, David Daney wrote:
[...]

Consider instead the case of a Marvell phy with no interrupts connected
on a v4.9.43 kernel, single CPU:


   0)               |                 phy_disconnect() {
   0)               |                   phy_stop_machine() {
   0)               |                     cancel_delayed_work_sync() {
   0) + 23.986 us   |                     } /* cancel_delayed_work_sync */
   0)               |                     phy_state_machine() {
   0)               |                       phy_start_aneg_priv() {

Thanks for providing the trace, I think I have an idea of what's going
on, see below.

   0)               |                         marvell_config_aneg() {
   0) ! 240.538 us  |                         } /* marvell_config_aneg */
   0) ! 244.971 us  |                       } /* phy_start_aneg_priv */
   0)               |                       queue_delayed_work_on() {
   0) + 18.016 us   |                       } /* queue_delayed_work_on */
   0) ! 268.184 us  |                     } /* phy_state_machine */
   0) ! 297.394 us  |                   } /* phy_stop_machine */
   0)               |                   phy_detach() {
   0)               |                     phy_suspend() {
   0)               |                       phy_ethtool_get_wol() {
   0)   0.677 us    |                       } /* phy_ethtool_get_wol */
   0)               |                       genphy_suspend() {
   0) + 71.250 us   |                       } /* genphy_suspend */
   0) + 74.197 us   |                     } /* phy_suspend */
   0) + 80.302 us   |                   } /* phy_detach */
   0) ! 380.072 us  |                 } /* phy_disconnect */
.
.
.
   0)               |  process_one_work() {
   0)               |    find_worker_executing_work() {
   0)   0.688 us    |    } /* find_worker_executing_work */
   0)               |    set_work_pool_and_clear_pending() {
   0)   0.734 us    |    } /* set_work_pool_and_clear_pending */
   0)               |    phy_state_machine() {
   0)               |      genphy_read_status() {
   0) ! 205.721 us  |      } /* genphy_read_status */
   0)               |      netif_carrier_off() {
   0)               |        do_page_fault() {


The do_page_fault() at the end indicates the NULL pointer dereference.

That added call to phy_state_machine() turns the polling back on
unconditionally for a phy that should be disconnected.  How is that
correct?

It is not fundamentally correct and I don't think there was any
objection to that to begin with. In fact there is a bug/inefficiency
here in that if we have entered the PHY state machine with PHY_HALTED we
should not re-schedule it period, only applicable to PHY_POLL cases
*and* properly calling phy_stop() followed by phy_disconnect().

What I now think is happening in your case is the following:

phy_stop() was not called, so nothing does set phydev->state to
PHY_HALTED in the first place so we have:

phy_disconnect()
-> phy_stop_machine()
        -> cancel_delayed_work_sync() OK
                phydev->state is probably RUNNING so we have:
                -> phydev->state = PHY_UP
        phy_state_machine() is called synchronously
        -> PHY_UP -> needs_aneg = true
        -> phy_restart_aneg()
        -> queue_delayed_work_sync()
-> phydev->adjust_link = NULL
-> phy_deatch() -> boom

Can you confirm whether the driver you are using does call phy_stop()
prior to phy_disconnect()?

There is no call to phy_stop().

I can add this to the ethernet drivers, but I wonder if it should be called by the code code when doing phy_disconnect(), if it was not already stopped.

If that is the case then this whole theory
falls apart, if not, then this needs fixing in both the driver and PHYLIB.

Thanks


Reply via email to