*TL;DR; a way to track replica state using EPHEMERAL nodes that disappear
automatically when a node goes down.*

Hi,

When running a cluster with many collections and replicas per node,
processing of DOWNNODE messages takes more time.
In a public cloud setup, the node that went down can come back quickly
before that processing is finished. When that happens, replicas are marked
DOWN by DOWNNODE while they are marked ACTIVE by the node starting, and
depending on how the two operations intermesh, some replicas then stay DOWN
forever (forever is until node is restarted).
We had to put in place K8s init containers to add a delay before nodes
restart. This delays rolling restarts, deployments and node crash recovery
so not a desirable long term solution.

What do you think of a change that avoids the need for a DOWNNODE message
altogether:
- Each replica state is captured as an *EPHEMERAL* node in Zookeeper
- No such node implicitly means the replica state is DOWN
- If the node is present, it contains an encoding of the actual state (DOWN,
ACTIVE, RECOVERING, RECOVERY_FAILED)
- When a node goes down (or when its ZK session expires) all its replica
state nodes automatically vanish.

This change is similar to the Per Replica State implementation (starting
point
<https://github.com/apache/solr/blob/main/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/PerReplicaStatesOps.java#L99C17-L99C17>
in the code) but different:
- EPHEMERAL rather than PERSISTENT Zookeeper nodes
- No duplicate replica state nodes (and no node version to pick the right
one)
- DOWNNODE not needed (if all collections are tracked in that way).
- Need to republish all replica states after Zookeeper session expiration
since they will disappear

What do you think? esp. Noble and Ishan the authors of PRS.
I have no detailed design and no code, just sharing an idea to solve
a real issue we're facing.

Ilan

Reply via email to