Believe I found the issue, it was caused a script when starting the zk instance, it also cleared all the solr configuration data from zk, causing the solr to stop working.
However, a new issue is coming: When using static IPs for zookeeper ensemble, it works perfectly and SOLR can reconnect to another live zookeeper. But, when use host name (dns name) as the ensemble, it seems SOLR use the static IP from the host name and cache it forever. So when doing the rolling updates to the zookeeper, SOLR eventually died because couldn’t connect to the previous IP even though the same name is pointing to the new IP. Any suggestion? Thanks Sean On 6/16/17, 3:34 PM, "Xie, Sean" <sean....@finra.org> wrote: Solr is configured with the zookeeper ensemble as mentioned below. I will provide logs in a later time. From: Shawn Heisey <apa...@elyograg.org<mailto:apa...@elyograg.org>> Date: Friday, Jun 16, 2017, 12:27 PM To: solr-user@lucene.apache.org <solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>> Subject: [EXTERNAL] Re: Live update the zookeeper for SOLR On 6/16/2017 9:05 AM, Xie, Sean wrote: > Is there a way to keep SOLR alive when zookeeper instances (3 instance > ensemble) are rolling updated one at a time? It seems SOLR cluster use > one of the zookeeper instance and when the communication is broken in > between, it won’t be able to reconnect to another zookeeper instance > and keep itself alive. A service restart is need in this situation. > Any way to keep the service alive all the time? Have you informed Solr about all three of the ZK hosts? You need a zkHost like this, with an optional chroot: zkHost="server1.example.com:2181,server2.example.com:2181,server3.example.com:2181/chroot" There are more zkHost examples and a better description of the string format in this javadoc: https://lucene.apache.org/solr/6_6_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html#CloudSolrClient-java.lang.String- If Solr hasn't been explicitly informed about all the hosts in the ensemble, then it cannot connect to surviving hosts. I've never heard of the problem you described happening as long as Zookeeper quorum is maintained and Solr is properly configured. If you can show that a correctly configured Solr 6.6 server loses connection to ZK when one of the ZK servers is taken down, that's a bug, and we need an issue in Jira with documentation of the problem. Thanks, Shawn Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.