janhoy commented on a change in pull request #1387: SOLR-14210: Include replica 
health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r403291043
 
 

 ##########
 File path: 
solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
 ##########
 @@ -88,15 +98,45 @@ public void handleRequestBody(SolrQueryRequest req, 
SolrQueryResponse rsp) throw
       return;
     }
 
-    // Set status to true if this node is in live_nodes
-    if 
(clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
-      rsp.add(STATUS, OK);
-    } else {
+    // Fail if not in live_nodes
+    if 
(!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
       rsp.add(STATUS, FAILURE);
       rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable: 
Not in live nodes as per zk"));
+      return;
     }
 
-    rsp.setHttpCaching(false);
+    // Optionally require that all cores on this node are active if param 
'requireHealthyCores=true'
+    if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+      Collection<CloudDescriptor> coreDescriptors = cores.getCores().stream()
+          .map(c -> 
c.getCoreDescriptor().getCloudDescriptor()).collect(Collectors.toList());
+      List<String> unhealthyCores = findUnhealthyCores(coreDescriptors, 
clusterState);
+      if (unhealthyCores.size() > 0) {
+          rsp.add(STATUS, FAILURE);
+          rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+                  "Replica(s) " + unhealthyCores + " are currently 
initializing or recovering"));
+          return;
+      }
+      rsp.add("message", "All cores are healthy");
+    }
+
+    // All lights green, report healthy
+    rsp.add(STATUS, OK);
+  }
+
+  /**
+   * Find replicas DOWN or RECOVERING, or replicas in clusterstate that do not 
exist on local node.
+   * We first find local cores which are either not registered or unhealthy, 
and check each of these against
+   * the clusterstate, and return a list of unhealthy replicas that are part 
of an active shard for an existing collection
+   * @param cores list of core descriptors to iterate
+   * @param clusterState clusterstate from ZK
+   * @return list of core names that are either DOWN ore RECOVERING on 
'nodeName'
+   */
+  static List<String> findUnhealthyCores(Collection<CloudDescriptor> cores, 
ClusterState clusterState) {
 
 Review comment:
   > I think it might be helpful to name the fallback 
`unknown:<collectionName>_<shardId>` or 
`code-loading:<collectionName>_<shardId>` to distinguish...
   
   I think I have changed my mind and agree with shalin that it is enough to 
return a count, and if that count > 0 include in the error msg "N out of M 
cores are still not healthy". That will avoid the confusion and give a clear 
and short state to caller. Imagine a node with 3000 cores just having been 
started but not yet recovered, that list of RECOVERING cores would be huge :) 
If you really need to know which cores are unhealthy, there are ways to find 
that elsewhere.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to