From: Tuong Lien <tuong.t.l...@dektech.com.au> Date: Mon, 11 Feb 2019 13:29:43 +0700
> When a link endpoint is re-created (e.g. after a node reboot or > interface reset), the link session number is varied by random, the peer > endpoint will be synced with this new session number before the link is > re-established. > > However, there is a shortcoming in this mechanism that can lead to the > link never re-established or faced with a failure then. It happens when > the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as > well as the 'in_session' flag have been set, but suddenly this link > endpoint leaves. When it comes back with a random session number, there > are two situations possible: > > 1/ If the random session number is larger than (or equal to) the > previous one, the peer endpoint will be updated with this new session > upon receipt of a RESET_MSG from this endpoint, and the link can be re- > established as normal. Otherwise, all the RESET_MSGs from this endpoint > will be rejected by the peer. In turn, when this link endpoint receives > one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start > to send STATE_MSGs, but again these messages will be dropped by the > peer due to wrong session. > The peer link endpoint can still become ESTABLISHED after receiving a > traffic message from this endpoint (e.g. a BCAST_PROTOCOL or > NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link > will be forced down sooner or later! > > Even in case the random session number is larger than the previous one, > it can be that the ACTIVATE_MSG from the peer arrives first, and this > link endpoint moves quickly to ESTABLISHED without sending out any > RESET_MSG yet. Consequently, the peer link will not be updated with the > new session number, and the same link failure scenario as above will > happen. > > 2/ Another situation can be that, the peer link endpoint was reset due > to any reasons in the meantime, its link state was set to RESET from > ESTABLISHING but still in session, i.e. the 'in_session' flag is not > reset... > Now, if the random session number from this endpoint is less than the > previous one, all the RESET_MSGs from this endpoint will be rejected by > the peer. In the other direction, when this link endpoint receives a > RESET_MSG from the peer, it moves to ESTABLISHING and starts to send > ACTIVATE_MSGs, but all these messages will be rejected by the peer too. > As a result, the link cannot be re-established but gets stuck with this > link endpoint in state ESTABLISHING and the peer in RESET! > > Solution: > =========== > > This link endpoint should not go directly to ESTABLISHED when getting > ACTIVATE_MSG from the peer which may belong to the old session if the > link was re-created. To ensure the session to be correct before the > link is re-established, the peer endpoint in ESTABLISHING state will > send back the last session number in ACTIVATE_MSG for a verification at > this endpoint. Then, if needed, a new and more appropriate session > number will be regenerated to force a re-synch first. > > In addition, when a link in ESTABLISHING state is reset, its state will > move to RESET according to the link FSM, along with resetting the > 'in_session' flag (and the other data) as a normal link reset, it will > also be deleted if requested. > > The solution is backward compatible. > > Acked-by: Jon Maloy <jon.ma...@ericsson.com> > Acked-by: Ying Xue <ying....@windriver.com> > Signed-off-by: Tuong Lien <tuong.t.l...@dektech.com.au> Applied.