Hangbin Liu <[email protected]> wrote:

>The current ad_churn_machine implementation only transitions the
>actor/partner churn state to churned or none after the churn timer expires.
>However, IEEE 802.1AX-2014 specifies that a port should enter the none
>state immediately once the actor’s port state enters synchronization.
>
>Another issue is that if the churn timer expires while the churn machine is
>not in the monitor state (e.g. already in churn), the state may remain
>stuck indefinitely with no further transitions. This becomes visible in
>multi-aggregator scenarios. For example:
>
>Ports 1 and 2 are in aggregator 1 (active)
>Ports 3 and 4 are in aggregator 2 (backup)
>
>Ports 1 and 2 should be in none
>Ports 3 and 4 should be in churned
>
>If a failover occurs due to port 2 link down/up, aggregator 2 becomes active.
>Under the current implementation, the resulting states may look like:
>
>agg 1 (backup): port 1 -> none, port 2 -> churned
>agg 2 (active): ports 3,4 keep in churned.
>
>The root cause is that ad_churn_machine() only clears the
>AD_PORT_CHURNED flag and starts a timer. When a churned port becomes active,
>its RX state becomes AD_RX_CURRENT, preventing the churn flag from being set
>again, leaving no way to retrigger the timer. Fixing this solely in
>ad_rx_machine() is insufficient.
>
>This patch rewrites ad_churn_machine according to IEEE 802.1AX-2014
>(Figures 6-23 and 6-24), ensuring correct churn detection, state transitions,
>and timer behavior. With new implementation, there is no need to set
>AD_PORT_CHURNED in ad_rx_machine().
>
>Fixes: 14c9551a32eb ("bonding: Implement port churn-machine (AD standard 
>43.4.17).")
>Reported-by: Liang Li <[email protected]>
>Tested-by: Liang Li <[email protected]>
>Signed-off-by: Hangbin Liu <[email protected]>

        I missed this last time it was posted, but reading it now I
think the functional change looks good, but I question the usefulness of
including the 25 line ASCII art version of the state diagram.

        The standard is publicly available, so a comment saying that the
state machine logic conforms to IEEE 802.1AX-2014 figures 6-23 and 6-24
should be sufficient.  Anyone seriously checking the code against the
standard will need to read the relevant text, so they'll be looking it
up anyway.

        -J

>---
> drivers/net/bonding/bond_3ad.c | 96 +++++++++++++++++++++++++---------
> 1 file changed, 71 insertions(+), 25 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index c47f6a69fd2a..68258d61fd1c 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -44,7 +44,6 @@
> #define AD_PORT_STANDBY         0x80
> #define AD_PORT_SELECTED        0x100
> #define AD_PORT_MOVED           0x200
>-#define AD_PORT_CHURNED         (AD_PORT_ACTOR_CHURN | AD_PORT_PARTNER_CHURN)
> 
> /* Port Key definitions
>  * key is determined according to the link speed, duplex and
>@@ -1254,7 +1253,6 @@ static void ad_rx_machine(struct lacpdu *lacpdu, struct 
>port *port)
>       /* first, check if port was reinitialized */
>       if (port->sm_vars & AD_PORT_BEGIN) {
>               port->sm_rx_state = AD_RX_INITIALIZE;
>-              port->sm_vars |= AD_PORT_CHURNED;
>       /* check if port is not enabled */
>       } else if (!(port->sm_vars & AD_PORT_BEGIN) && !port->is_enabled)
>               port->sm_rx_state = AD_RX_PORT_DISABLED;
>@@ -1262,8 +1260,6 @@ static void ad_rx_machine(struct lacpdu *lacpdu, struct 
>port *port)
>       else if (lacpdu && ((port->sm_rx_state == AD_RX_EXPIRED) ||
>                (port->sm_rx_state == AD_RX_DEFAULTED) ||
>                (port->sm_rx_state == AD_RX_CURRENT))) {
>-              if (port->sm_rx_state != AD_RX_CURRENT)
>-                      port->sm_vars |= AD_PORT_CHURNED;
>               port->sm_rx_timer_counter = 0;
>               port->sm_rx_state = AD_RX_CURRENT;
>       } else {
>@@ -1347,7 +1343,6 @@ static void ad_rx_machine(struct lacpdu *lacpdu, struct 
>port *port)
>                       port->partner_oper.port_state |= 
> LACP_STATE_LACP_TIMEOUT;
>                       port->sm_rx_timer_counter = 
> __ad_timer_to_ticks(AD_CURRENT_WHILE_TIMER, (u16)(AD_SHORT_TIMEOUT));
>                       port->actor_oper_port_state |= LACP_STATE_EXPIRED;
>-                      port->sm_vars |= AD_PORT_CHURNED;
>                       break;
>               case AD_RX_DEFAULTED:
>                       __update_default_selected(port);
>@@ -1379,11 +1374,41 @@ static void ad_rx_machine(struct lacpdu *lacpdu, 
>struct port *port)
>  * ad_churn_machine - handle port churn's state machine
>  * @port: the port we're looking at
>  *
>+ * IEEE 802.1AX-2014 Figure 6-23 - Actor Churn Detection machine state diagram
>+ *
>+ *                                                     BEGIN || (! 
>port_enabled)
>+ *                                                               |
>+ *                                      (3)                (1)   v
>+ *   +----------------------+     ActorPort.Sync     
>+-------------------------+
>+ *   |    NO_ACTOR_CHURN    | <--------------------- |   ACTOR_CHURN_MONITOR  
> |
>+ *   |======================|                        
>|=========================|
>+ *   | actor_churn = FALSE; |    ! ActorPort.Sync    | actor_churn = FALSE;   
> |
>+ *   |                      | ---------------------> | Start 
>actor_churn_timer |
>+ *   +----------------------+           (4)          
>+-------------------------+
>+ *             ^                                                 |
>+ *             |                                                 |
>+ *             |                                      actor_churn_timer 
>expired
>+ *             |                                                 |
>+ *       ActorPort.Sync                                          |  (2)
>+ *             |              +--------------------+             |
>+ *        (3)  |              |   ACTOR_CHURN      |             |
>+ *             |              |====================|             |
>+ *             +------------- | actor_churn = True | <-----------+
>+ *                            |                    |
>+ *                            +--------------------+
>+ *
>+ * Similar for the Figure 6-24 - Partner Churn Detection machine state diagram
>+ *
>+ * We don’t need to check actor_churn, because it can only be true when the
>+ * state is ACTOR_CHURN.
>  */
> static void ad_churn_machine(struct port *port)
> {
>-      if (port->sm_vars & AD_PORT_CHURNED) {
>-              port->sm_vars &= ~AD_PORT_CHURNED;
>+      bool partner_synced = port->partner_oper.port_state & 
>LACP_STATE_SYNCHRONIZATION;
>+      bool actor_synced = port->actor_oper_port_state & 
>LACP_STATE_SYNCHRONIZATION;
>+
>+      /* ---- 1. begin or port not enabled ---- */
>+      if ((port->sm_vars & AD_PORT_BEGIN) || !port->is_enabled) {
>               port->sm_churn_actor_state = AD_CHURN_MONITOR;
>               port->sm_churn_partner_state = AD_CHURN_MONITOR;
>               port->sm_churn_actor_timer_counter =
>@@ -1392,25 +1417,46 @@ static void ad_churn_machine(struct port *port)
>                        __ad_timer_to_ticks(AD_PARTNER_CHURN_TIMER, 0);
>               return;
>       }
>-      if (port->sm_churn_actor_timer_counter &&
>-          !(--port->sm_churn_actor_timer_counter) &&
>-          port->sm_churn_actor_state == AD_CHURN_MONITOR) {
>-              if (port->actor_oper_port_state & LACP_STATE_SYNCHRONIZATION) {
>-                      port->sm_churn_actor_state = AD_NO_CHURN;
>-              } else {
>-                      port->churn_actor_count++;
>-                      port->sm_churn_actor_state = AD_CHURN;
>-              }
>+
>+      if (port->sm_churn_actor_timer_counter)
>+              port->sm_churn_actor_timer_counter--;
>+
>+      if (port->sm_churn_partner_timer_counter)
>+              port->sm_churn_partner_timer_counter--;
>+
>+      /* ---- 2. timer expired, enter CHURN ---- */
>+      if (port->sm_churn_actor_state == AD_CHURN_MONITOR &&
>+          !port->sm_churn_actor_timer_counter) {
>+              port->sm_churn_actor_state = AD_CHURN;
>+              port->churn_actor_count++;
>       }
>-      if (port->sm_churn_partner_timer_counter &&
>-          !(--port->sm_churn_partner_timer_counter) &&
>-          port->sm_churn_partner_state == AD_CHURN_MONITOR) {
>-              if (port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) 
>{
>-                      port->sm_churn_partner_state = AD_NO_CHURN;
>-              } else {
>-                      port->churn_partner_count++;
>-                      port->sm_churn_partner_state = AD_CHURN;
>-              }
>+
>+      if (port->sm_churn_partner_state == AD_CHURN_MONITOR &&
>+          !port->sm_churn_partner_timer_counter) {
>+              port->sm_churn_partner_state = AD_CHURN;
>+              port->churn_partner_count++;
>+      }
>+
>+      /* ---- 3. CHURN_MONITOR/CHURN + sync -> NO_CHURN ---- */
>+      if ((port->sm_churn_actor_state == AD_CHURN_MONITOR ||
>+           port->sm_churn_actor_state == AD_CHURN) && actor_synced)
>+              port->sm_churn_actor_state = AD_NO_CHURN;
>+
>+      if ((port->sm_churn_partner_state == AD_CHURN_MONITOR ||
>+           port->sm_churn_partner_state == AD_CHURN) && partner_synced)
>+              port->sm_churn_partner_state = AD_NO_CHURN;
>+
>+      /* ---- 4. NO_CHURN + !sync -> MONITOR ---- */
>+      if (port->sm_churn_actor_state == AD_NO_CHURN && !actor_synced) {
>+              port->sm_churn_actor_state = AD_CHURN_MONITOR;
>+              port->sm_churn_actor_timer_counter =
>+                      __ad_timer_to_ticks(AD_ACTOR_CHURN_TIMER, 0);
>+      }
>+
>+      if (port->sm_churn_partner_state == AD_NO_CHURN && !partner_synced) {
>+              port->sm_churn_partner_state = AD_CHURN_MONITOR;
>+              port->sm_churn_partner_timer_counter =
>+                      __ad_timer_to_ticks(AD_PARTNER_CHURN_TIMER, 0);
>       }
> }
> 
>-- 
>2.50.1
>

---
        -Jay Vosburgh, [email protected]

Reply via email to