https://issues.apache.org/bugzilla/show_bug.cgi?id=45261
--- Comment #1 from Robert Newson <[EMAIL PROTECTED]> 2008-06-25 13:50:40 PST --- So, I understand this better now and have a proposed fix. Here's the procedure to reproduce the problem. 1) start four nodes. 2) see a view installation with four members. 3) kill two non-coordinator nodes in quick succession (a second or two) >From this point onwards, until it is killed, the coordinator is oscillating between two states. It recognizes that the state is inconsistent as it receives heartbeats from the the other node and the UniqueId's of its view does not match the coordinator. It then forces an election. Which fails as it believes an election is already running. This cycle repeats forever. When the first node crashed, memberDisappeared() is called on the coordinator. It then starts sending messages as part of an election. A method throws here with a connection timeout (it was attempting to send to the second node, which just crashed). It never handles this case, leaving the 'election in progress' flag on. Forever. Clearing suggestedViewId when the ChannelException is thrown is the fix; @@ -500,6 +500,7 @@ public class NonBlockingCoordinator extends ChannelInterceptorBase { processCoordMessage(cmsg, msg.getAddress()); }catch ( ChannelException x ) { log.error("Error processing coordination message. Could be fatal.",x); + suggestedviewId = null; } this probably should only be done under some circumstances, so this isn't obviously a safe patch. Hopefully the author will have a better fix! -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]