Hi everyone, I ran into a situation in which a portion of the nodes in a Solr cluster were stuck in the GONE state (no ephemeral znode under /live_nodes) despite the Solr processes being alive and connected to the ZooKeeper tier. For context this a cluster running Solr 7.7.1 and ZooKeeper 3.4.13 (being monitored by Exhibitor 1.7.1).
The events that led to this situation, in the order that they occurred: * We have three ZooKeeper nodes: A, B, and C. A is the leader of the ensemble. * ZooKeeper A becomes partitioned from ZooKeeper B and C and the Solr tier. * Some Solr nodes log “zkclient has disconnected” warnings and ZooKeeper A expires some Solr client sessions due to timeouts. The partition between Zookeeper A and the Solr tier ends and Solr nodes that were connected to ZooKeeper A attempt to renew their sessions but are told their sessions have expired. [1] * Note that I’m simplifying: some nodes that were connected to ZooKeeper A were able to move their sessions to ZooKeeper B/C before their session expired. [2] * ZooKeeper A realizes it is not synced with ZooKeeper B and C and closes connections with Solr nodes and, apparently, remains partitioned from B/C. * ZooKeeper B and C eventually elect ZooKeeper B as the leader and start accepting writes requests as they form a quorum. * Solr nodes previously connected to ZooKeeper that had their sessions expire now connect to ZooKeeper B and C, they successfully publish their state as DOWN, and then attempt to write to /live_nodes to signal that they’re reconnected to ZooKeeper. * The writes of the ephemeral znodes to /live_nodes fail with NodeExists exceptions [3]. The failed writes are logged on ZooKeeper B. [4] * It looks like a failure mode of “leader becomes partitioned and ephemeral znode deletions are not processed by followers” is documented on ZOOKEEPER-2348<https://jira.apache.org/jira/browse/ZOOKEEPER-2348>. * ZooKeeper A eventually rejoins the ensemble and the /live_nodes entries that expired after the initial partition are removed when session expirations are reprocessed on the new leader (ZooKeeper B) [5] * The Solr nodes whose attempts at writing to /live_nodes failed never try again and remain in the GONE state for 6+ hours. Restarting the GONE Solr nodes made them reconnect to ZooKeeper and everything recovered just fine after that My questions are: * Is the sequence of events I’ve laid out plausible? Is there anything that stands out as being “wrong” that I should look more in to? * If not, is there anything I can/should be doing to prevent something like happening again or at least reducing the probability of its occurrence? Or is this just a failure scenario that’s not being handled at the moment and it merits a ticket? I wasn’t able to find any mailing list thread or ticket that discussed a similar issue though I’m also not confident my search was exhaustive.☺ Apologies if this is documented somewhere. Thanks! Marko [1] Log of ZooKeeper A expiring a Solr node session and not allowing the node to renew it: 2019-04-13 06:58:23.997 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer – Expiring session 0x7000014eb60267d, timeout of 15000ms exceeded … 2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.ZooKeeperServer – Client attempting to renew session 0x7000014eb60267d at /XXX.XXX.XXX.XXX:54792 2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.ZooKeeperServer – Invalid session 0x7000014eb60267d for client /XXX.XXX.XXX.XXX:54792, probably expired … 2019-04-13 06:58:24.017 [ProcessThread(sid:9 cport:-1):] INFO org.apache.zookeeper.server.PrepRequestProcessor – Processed session termination for sessionid: 0x7000014eb60267d [2] Solr node moving its session from ZooKeeper A to ZooKeeper B/C: Solr: 2019-04-13 06:58:15.134000 [main-SendThread(zookeeper03.localdomain:2181)] WARN org.apache.zookeeper.ClientCnxn [] u2013 Client session timed out, have not heard from server in 10005ms for sessionid 0x900000676eb264b ZooKeeper A: 2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN org.apache.zookeeper.server.NIOServerCnxn – Unable to read additional data from client sessionid 0x900000676eb264b, likely client has closed socket 2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxn – Closed socket connection for client /XXX.XXX.XXX.XXX:55750 which had sessionid 0x900000676eb264b … 2019-04-13 06:58:24.018 [ProcessThread(sid:9 cport:-1):] INFO org.apache.zookeeper.server.PrepRequestProcessor – Got user-level KeeperException when processing sessionid:0x900000676eb264b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved ZooKeeper B: 2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.ZooKeeperServer – Client attempting to renew session 0x900000676eb264b at /XXX.XXX.XXX.XXX:32836 2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.quorum.Learner – Revalidating client: 0x900000676eb264b [3] Example log lines + stack trace of /live_nodes write failure: 2019-04-13 06:58:24.022000 [main-SendThread(zookeeper03.localdomain:2181)] WARN org.apache.zookeeper.ClientCnxn [] u2013 Unable to reconnect to ZooKeeper service, session ≈ has expired 2019-04-13 06:59:01.082000 [zkConnectionManagerCallback-6-thread-1] INFO org.apache.solr.cloud.ZkController [] u2013 ZooKeeper session re-connected ... refreshing core states after session expiration. 2019-04-13 06:59:01.101000 [zkConnectionManagerCallback-6-thread-1] INFO org.apache.solr.cloud.ZkController [] u2013 Publish node=XXX.XXX.XXX.XXX:8983_solr as DOWN 2019-04-13 06:59:01.105000 [zkConnectionManagerCallback-6-thread-1] INFO org.apache.solr.cloud.ZkController [] u2013 Register node as live in ZooKeeper:/live_nodes/XXX.XXX.XXX.XXX:8983_solr 2019-04-13 06:59:01.108000 [zkConnectionManagerCallback-6-thread-1] ERROR org.apache.solr.cloud.ZkController [] u2013 :org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:122) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1015) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:919) at org.apache.solr.common.cloud.SolrZkClient.lambda$multi$9(SolrZkClient.java:604) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71) at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:604) at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:1089) at org.apache.solr.cloud.ZkController.access$500(ZkController.java:149) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:376) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:177) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:58) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:151) at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:285) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [4] Example log lines + stack trace of /live_nodes write failure in zk logs: 2019-04-13 06:59:01.405 [ProcessThread(sid:8 cport:-1):] INFO org.apache.zookeeper.server.PrepRequestProcessor – Got user-level KeeperException when processing sessionid:0x7006dfdb0d00004 type:multi cxid:0x1a zxid:0x2700000035 txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/live_nodes/10.17.129.123:8983_solr Error:KeeperErrorCode = NodeExists for /live_nodes/XXX.XXX.XXX.XXXx:8983_solr [5] ZooKeeper B (new leader after ensemble partitions) processing old session expirations. Note that sessionId 0x7000014eb60267d is the same session that ZooKeeper A expired in [1]. 2019-04-13 06:59:16.071 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer – Expiring session 0x7000014eb60267d, timeout of 15000ms exceeded … 2019-04-13 06:59:16.074 [ProcessThread(sid:8 cport:-1):] INFO org.apache.zookeeper.server.PrepRequestProcessor – Processed session termination for sessionid: 0x7000014eb60267d