HoustonPutman commented on a change in pull request #1387: SOLR-14210: Include replica health in healtcheck handler URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r401691874
########## File path: solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java ########## @@ -88,15 +95,42 @@ public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throw return; } - // Set status to true if this node is in live_nodes - if (clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) { - rsp.add(STATUS, OK); - } else { + // Fail if not in live_nodes + if (!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) { rsp.add(STATUS, FAILURE); rsp.setException(new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable: Not in live nodes as per zk")); + return; } - rsp.setHttpCaching(false); + // Optionally require that all cores on this node are active if param 'failWhenRecovering=true' + if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) { + List<String> unhealthyCores = findUnhealthyCores(clusterState, cores.getNodeConfig().getNodeName()); + if (unhealthyCores.size() > 0) { + rsp.add(STATUS, FAILURE); + rsp.setException(new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, + "Replica(s) " + unhealthyCores + " are currently initializing or recovering")); + return; + } + rsp.add("MESSAGE", "All cores are healthy"); + } + + // All lights green, report healthy + rsp.add(STATUS, OK); + } + + /** + * Find replicas DOWN or RECOVERING + * @param clusterState clusterstate from ZK + * @param nodeName this node name + * @return list of core names that are either DOWN ore RECOVERING on 'nodeName' + */ + static List<String> findUnhealthyCores(ClusterState clusterState, String nodeName) { + return clusterState.getCollectionsMap().values().stream() Review comment: > > Also if the clusterState thinks that cores live on this node, but the core directories do not exist, then I think that this handler should respond not healthy. > > Yes, we can add an extra check that for each replica in clusterstate for node, we check that it exists locally? I think that the logic as it stands now should work, because the cluster state will report the replica as "DOWN" (From my experience, but we can also add tests around this). The comment was meant to validate the current approach over the other one you mentioned: > The alternative is to instead iterate cores on current node, and consult with clusterState their overall state. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org