shalinmangar commented on a change in pull request #1387: SOLR-14210: Include
replica health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r403453516
##########
File path:
solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
##########
@@ -88,15 +98,45 @@ public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) throw
return;
}
- // Set status to true if this node is in live_nodes
- if
(clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
- rsp.add(STATUS, OK);
- } else {
+ // Fail if not in live_nodes
+ if
(!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
rsp.add(STATUS, FAILURE);
rsp.setException(new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable:
Not in live nodes as per zk"));
+ return;
}
- rsp.setHttpCaching(false);
+ // Optionally require that all cores on this node are active if param
'requireHealthyCores=true'
+ if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+ Collection<CloudDescriptor> coreDescriptors = cores.getCores().stream()
+ .map(c ->
c.getCoreDescriptor().getCloudDescriptor()).collect(Collectors.toList());
+ List<String> unhealthyCores = findUnhealthyCores(coreDescriptors,
clusterState);
+ if (unhealthyCores.size() > 0) {
+ rsp.add(STATUS, FAILURE);
+ rsp.setException(new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+ "Replica(s) " + unhealthyCores + " are currently
initializing or recovering"));
+ return;
+ }
+ rsp.add("message", "All cores are healthy");
+ }
+
+ // All lights green, report healthy
+ rsp.add(STATUS, OK);
+ }
+
+ /**
+ * Find replicas DOWN or RECOVERING, or replicas in clusterstate that do not
exist on local node.
+ * We first find local cores which are either not registered or unhealthy,
and check each of these against
+ * the clusterstate, and return a list of unhealthy replicas that are part
of an active shard for an existing collection
+ * @param cores list of core descriptors to iterate
+ * @param clusterState clusterstate from ZK
+ * @return list of core names that are either DOWN ore RECOVERING on
'nodeName'
+ */
+ static List<String> findUnhealthyCores(Collection<CloudDescriptor> cores,
ClusterState clusterState) {
Review comment:
> When the core directory no longer exists (say for example the disk was
wiped before starting the node), the clusterState will register the missing
replica(s) as DOWN and the logs will error saying that those cores cannot be
found. Will those missing cores still be returned within the cores.getCores()
call?
No, if the core directories themselves have been wiped off then the node
cannot return that core in `cores.getCores()`.
> but if they aren't included as I suspect, then this will return healthy
even when there are replicas in the clusterState scheduled on the node that are
not healthy.
Yes but that should be okay? The node itself is in fact healthy.
> Maybe this just requires a fix in a different part of solr to auto-delete
replicas that have cores that are missing on startup.
Is this a common case? i.e. wiping disks and putting the nodes back in
rotation? It is more common to have nodes which have cores that are not in the
cluster state and those are unloaded automatically if those nodes come back up.
Also, there's node lost trigger which can be used to delete replicas from
cluster state for nodes that go away for a long time.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]