HoustonPutman commented on a change in pull request #1387: SOLR-14210: Include
replica health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r401708980
##########
File path:
solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
##########
@@ -88,15 +95,42 @@ public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) throw
return;
}
- // Set status to true if this node is in live_nodes
- if
(clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
- rsp.add(STATUS, OK);
- } else {
+ // Fail if not in live_nodes
+ if
(!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
rsp.add(STATUS, FAILURE);
rsp.setException(new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable:
Not in live nodes as per zk"));
+ return;
}
- rsp.setHttpCaching(false);
+ // Optionally require that all cores on this node are active if param
'failWhenRecovering=true'
+ if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+ List<String> unhealthyCores = findUnhealthyCores(clusterState,
cores.getNodeConfig().getNodeName());
+ if (unhealthyCores.size() > 0) {
+ rsp.add(STATUS, FAILURE);
+ rsp.setException(new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+ "Replica(s) " + unhealthyCores + " are currently
initializing or recovering"));
+ return;
+ }
+ rsp.add("MESSAGE", "All cores are healthy");
+ }
+
+ // All lights green, report healthy
+ rsp.add(STATUS, OK);
+ }
+
+ /**
+ * Find replicas DOWN or RECOVERING
+ * @param clusterState clusterstate from ZK
+ * @param nodeName this node name
+ * @return list of core names that are either DOWN ore RECOVERING on
'nodeName'
+ */
+ static List<String> findUnhealthyCores(ClusterState clusterState, String
nodeName) {
+ return clusterState.getCollectionsMap().values().stream()
Review comment:
> If a shard split is currently running (could be long running), on a node
being restarted, the split would be aborted but when the node comes up again I
believe the overseer might try again??
And I understand your decision for the active shards better now. As long as
the overseer thing is true, then we should be fine. And if in the future if we
need to, we can add another parameter to fail on inactive slices as well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]