Hi, Multiple 6.5.1. clouds / collections went down this weekend around the same time, they share the same ZK quorum. The nodes stayed up but did not rejoin the cluster (find or connect to ZK)
This is what the log told us: 2017-05-06 18:58:34.893 WARN (zkCallback-5-thread-9-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@4f97bdad name: ZooKe eperConnection Watcher:89.188.14.10:2181,89.188.14.11:2181,89.188.14.12:2181/solr_collection_search got event WatchedEvent state:Disconnected type:None path:null path: null type: None 2017-05-06 18:58:34.893 WARN (zkCallback-5-thread-9-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.ConnectionManager zkClient has disconnected 2017-05-06 18:58:35.001 WARN (zkCallback-9-thread-5-processing-n:idx6.example.org:8983_solr x:search_shard2_replica3 s:shard2 c:search r:core_node6-EventThread) [c:search s:shard2 r:core_node6 x:search_shard2_replica3] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@c226cc name: ZooKeeperConnection Watcher:89.188.14.10:2181,89.188.14.11:2181,89.188.14.12:2181/solr_collection_search got event WatchedEvent state:Disconnected type:None path:null path: null type: None 2017-05-06 18:58:35.010 WARN (zkCallback-9-thread-5-processing-n:idx6.example.org:8983_solr x:search_shard2_replica3 s:shard2 c:search r:core_node6-EventThread) [c:search s:shard2 r:core_node6 x:search_shard2_replica3] o.a.s.c.c.ConnectionManager zkClient has disconnected 2017-05-06 18:58:45.360 WARN (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@4f97bdad name: ZooKeeperConnection Watcher:89.188.14.10:2181,89.188.14.11:2181,89.188.14.12:2181/solr_collection_search got event WatchedEvent state:Expired type:None path:null path: null type: None 2017-05-06 18:58:45.360 WARN (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 2017-05-06 18:58:45.380 WARN (OverseerStateUpdate-97740792370385619-idx6.example.org:8983_solr-n_0000000558) [ ] o.a.s.c.Overseer Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:339) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:336) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:336) at org.apache.solr.cloud.DistributedQueue.fetchZkChildren(DistributedQueue.java:308) at org.apache.solr.cloud.DistributedQueue.firstChild(DistributedQueue.java:285) at org.apache.solr.cloud.DistributedQueue.firstElement(DistributedQueue.java:393) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:159) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:137) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:180) at java.lang.Thread.run(Thread.java:745) 2017-05-06 18:58:45.381 WARN (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.DefaultConnectionStrategy Connection expired - starting a new one... 2017-05-06 18:58:45.382 ERROR (OverseerExitThread) [ ] o.a.s.c.Overseer could not read the data org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:287) at java.lang.Thread.run(Thread.java:745) 2017-05-06 18:58:46.453 WARN (zkCallback-9-thread-5-processing-n:idx6.example.org:8983_solr x:search_shard2_replica3 s:shard2 c:search r:core_node6-EventThread) [c:search s:shard2 r:core_node6 x:search_shard2_replica3] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@c226cc name: ZooKeeperConnection Watcher:89.188.14.10:2181,89.188.14.11:2181,89.188.14.12:2181/solr_collection_search got event WatchedEvent state:Expired type:None path:null path: null type: None 2017-05-06 18:58:46.453 WARN (zkCallback-9-thread-5-processing-n:idx6.example.org:8983_solr x:search_shard2_replica3 s:shard2 c:search r:core_node6-EventThread) [c:search s:shard2 r:core_node6 x:search_shard2_replica3] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 2017-05-06 18:58:46.460 WARN (zkCallback-9-thread-5-processing-n:idx6.example.org:8983_solr x:search_shard2_replica3 s:shard2 c:search r:core_node6-EventThread) [c:search s:shard2 r:core_node6 x:search_shard2_replica3] o.a.s.c.c.DefaultConnectionStrategy Connection expired - starting a new one... 2017-05-06 18:58:53.599 ERROR (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.ZkController :org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /live_nodes/idx6.example.org:8983_solr at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:526) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:523) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:466) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:453) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:430) at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:823) at org.apache.solr.cloud.ZkController.access$600(ZkController.java:120) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:340) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:168) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:57) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:142) at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:268) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-05-06 18:58:53.599 ERROR (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper failed:org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.cloud.ZkController$1.command(ZkController.java:392) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:168) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:57) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:142) at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:268) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /live_nodes/idx6.example.org:8983_solr at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:526) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:523) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:466) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:453) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:430) at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:823) at org.apache.solr.cloud.ZkController.access$600(ZkController.java:120) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:340) ... 10 more 2017-05-06 18:58:53.600 WARN (zkCallback-5-thread-8-processing-n:idx6.example.org:8983_solr) [ ] o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper failed 2017-05-06 18:58:57.052 ERROR (qtp1873653341-14807) [ ] o.a.s.h.RequestHandlerBase org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/search/state.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353) at org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1110) at org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:321) at org.apache.solr.handler.admin.PrepRecoveryOp.execute(PrepRecoveryOp.java:102) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:370) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:748) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:729) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:510) After that we occasionally see: 2017-05-06 18:58:59.079 ERROR (qtp1873653341-14989) [ ] o.a.s.s.HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/search/state.json We executed a hard Solr restart to get stuff back up. Is this a known issue? Thanks, Markus