*TL;DR; a way to track replica state using EPHEMERAL nodes that disappear automatically when a node goes down.*
Hi, When running a cluster with many collections and replicas per node, processing of DOWNNODE messages takes more time. In a public cloud setup, the node that went down can come back quickly before that processing is finished. When that happens, replicas are marked DOWN by DOWNNODE while they are marked ACTIVE by the node starting, and depending on how the two operations intermesh, some replicas then stay DOWN forever (forever is until node is restarted). We had to put in place K8s init containers to add a delay before nodes restart. This delays rolling restarts, deployments and node crash recovery so not a desirable long term solution. What do you think of a change that avoids the need for a DOWNNODE message altogether: - Each replica state is captured as an *EPHEMERAL* node in Zookeeper - No such node implicitly means the replica state is DOWN - If the node is present, it contains an encoding of the actual state (DOWN, ACTIVE, RECOVERING, RECOVERY_FAILED) - When a node goes down (or when its ZK session expires) all its replica state nodes automatically vanish. This change is similar to the Per Replica State implementation (starting point <https://github.com/apache/solr/blob/main/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/PerReplicaStatesOps.java#L99C17-L99C17> in the code) but different: - EPHEMERAL rather than PERSISTENT Zookeeper nodes - No duplicate replica state nodes (and no node version to pick the right one) - DOWNNODE not needed (if all collections are tracked in that way). - Need to republish all replica states after Zookeeper session expiration since they will disappear What do you think? esp. Noble and Ishan the authors of PRS. I have no detailed design and no code, just sharing an idea to solve a real issue we're facing. Ilan