On 2/27/2018 10:57 AM, James Keeney wrote: > *1 - ZK ensemble not accepting return of node* > Currently, when a ZK node in the ensemble goes down the ensemble is able to > do what it should do and keeps working. However when I bring the 3rd node > back online the other two nodes reject connection requests from the 3rd > node until I restart the nodes. The sequence is: > > 1. Bring 3rd node back on line > 2. Restart follower in existing ensemble > 3. Restart leader in existing ensemble > > When this is done the third node happily becomes part fo the ensemble.
>From what I understand, restarting the other nodes should not be required. If everything is configured properly, I don't think that should be happening, but I don't have deep ZK knowledge. > *2 - Solr nodes unable to connect* > When setting up the cluster for the first time the ensemble rejects the > solr connection requests until the ZK on the ZK ensemble members is > restarted. <snip> > However, we have also seen that if we have a problem with one of the Solr > nodes that requires restarting more than one node we have to restart ZK to > reconnect the nodes with thee ensemble again. These problems sound very weird too. I wish I had some idea, but without logs showing what kind of errors are encountered, I have no idea what's happening. None of these problems are in Solr code. Solr uses the ZooKeeper client code without modification. All the ZK communication is done in ZK code, initialized with the zkHost string and a few other config bits (like zkClientTimeout) provided to Solr at startup. If you want to share the Solr log and the ZK server logs covering the timeframe when the problems happen, maybe we can find something useful and at least point you towards the problem, but even then, you may have to talk to the ZooKeeper mailing list for real help, and they'll want the same logs. Are you informing Solr about all three of your ZK hosts when you start it up? That is a requirement. If the zkHost string you send to Solr doesn't list all your servers, then the ZK client inside Solr will not be able to fail over correctly. The version of ZK that Solr includes is not able to dynamically change the servers that it talks to, and the version of ZK that *does* have dynamic reconfiguration is still in beta. Solr is not going to include ZK 3.5.x until they put out a stable release. I don't know when they're going to do that. It could be soon, or it could be several months out. The ZK project does NOT make frequent releases. Thanks, Shawn