patsonluk opened a new pull request, #2673: URL: https://github.com/apache/lucene-solr/pull/2673
## Description It's found that our prod env have certain data nodes have "ghost replicas" that do not have data dir but has the core.properties file and core directory. Replica with same name is actually reside on a different node as defined in the `state.json`. Such "ghost replicas" can trigger `DOWN` replica state being published, which the real replica (with same name) is actually healthy in another node. More details of the issue can be found in https://app.shortcut.com/fullstory/story/217252/investigate-replica-that-failed-to-come-up-as-active-during-restart-deployment#activity-217734 ## Solution While we do not yet know the exact cause of those "ghost replicas" (probably from some migration hiccup during c82 creation?), it seems to be a rare occurrence now (8 replicas in c82). Therefore we will add a new exception `InconsistentClusterStateException`, which would be thrown from `checkStateInZk` if node name of a replica defined in state.json is different from the current node which tries to spin up such core. Such exception would interrupt the core creation, and no longer publish a `DOWN` state. For now, we will NOT provide an cleanup in the Solr code, as this seems to be an edge case and cleanup (ie unload core and remove the physical directory) could be risky. Take note that we will probably still need to "clean up" those ghost replicas later on perhaps by manually purging them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org