https://issues.apache.org/bugzilla/show_bug.cgi?id=45261





--- Comment #1 from Robert Newson <[EMAIL PROTECTED]>  2008-06-25 13:50:40 PST 
---

So, I understand this better now and have a proposed fix.

Here's the procedure to reproduce the problem.

1) start four nodes.
2) see a view installation with four members.
3) kill two non-coordinator nodes in quick succession (a second or two)

>From this point onwards, until it is killed, the coordinator is oscillating
between two states. It recognizes that the state is inconsistent as it receives
heartbeats from the the other node and the UniqueId's of its view does not
match the coordinator. It then forces an election. Which fails as it believes
an election is already running. This cycle repeats forever.

When the first node crashed, memberDisappeared() is called on the coordinator.
It then starts sending messages as part of an election. A method throws here
with a connection timeout (it was attempting to send to the second node, which
just crashed). It never handles this case, leaving the 'election in progress'
flag on. Forever.

Clearing suggestedViewId when the ChannelException is thrown is the fix;

@@ -500,6 +500,7 @@ public class NonBlockingCoordinator extends
ChannelInterceptorBase {
                 processCoordMessage(cmsg, msg.getAddress());
             }catch ( ChannelException x ) {
                 log.error("Error processing coordination message. Could be
fatal.",x);
+                suggestedviewId = null;                
             }

this probably should only be done under some circumstances, so this isn't
obviously a safe patch. Hopefully the author will have a better fix!


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to