Hey,

I am encountering an issue which looks a lot like 
https://issues.apache.org/jira/browse/SOLR-6763.

However, it seems like the fix for that does not address the entire problem. 
That fix will only work if we hit the zkClient.getChildren() call before the 
reconnect logic has finished reconnecting us to ZooKeeper (I can reproduce 
scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, 
we won’t get the session timeout exception.

The specific problem I am seeing is slightly different SOLR-6763, but the root 
cause appears to be the same. The issue that I am seeing is; during startup the 
collections are registered and there is one coreZkRegister-1-thread-* per 
collection. The elections are started on this thread, the 
/collections/<name>/leader_elect ZNodes are created, and then the thread blocks 
waiting for the peers to become available. During the block the ZooKeeper 
session times out.

Once we finish blocking, the reconnect logic calls register() for each 
collection, which restarts the election process (although serially this time). 
At a later point, we can have two threads that are trying to register the same 
collection.

This is incorrect, because the coreZkRegister-1-thread-’s are assuming they are 
leader with no verification from zookeeper. The ephemeral leader_elect nodes 
they created were removed when the session timed out. If another host started 
in the interim (or any point after that actually), it would see no leader, and 
would attempt to become leader of the shard itself. This leads to some 
interesting race conditions, where you can end up with two leaders for a shard.

It seems like a more complete fix would be to actually close the 
ElectionContext upon reconnect. This would break us out of the wait for peers 
loop, and stop the threads from processing the rest of the leadership logic. 
The reconnection logic would then continue to call register() again for each 
Collection, and if the ZK state indicates it should be leader it can re-run the 
leadership logic.

I have a patch in testing that does this, and I think addresses the problem.

What is the general process for this? I didn’t want to reopen a close Jira 
item. Should I create a new one so the issue and the proposed fix can be 
discussed?

Thanks.

Mike.


Reply via email to