shalinmangar commented on a change in pull request #1387: SOLR-14210: Include 
replica health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r402372168
 
 

 ##########
 File path: 
solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
 ##########
 @@ -88,15 +96,46 @@ public void handleRequestBody(SolrQueryRequest req, 
SolrQueryResponse rsp) throw
       return;
     }
 
-    // Set status to true if this node is in live_nodes
-    if 
(clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
-      rsp.add(STATUS, OK);
-    } else {
+    // Fail if not in live_nodes
+    if 
(!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
       rsp.add(STATUS, FAILURE);
       rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable: 
Not in live nodes as per zk"));
+      return;
     }
 
-    rsp.setHttpCaching(false);
+    // Optionally require that all cores on this node are active if param 
'requireHealthyCores=true'
+    if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+      List<String> unhealthyCores = findUnhealthyCores(clusterState,
+              cores.getNodeConfig().getNodeName(),
+              cores.getAllCoreNames());
+      if (unhealthyCores.size() > 0) {
+          rsp.add(STATUS, FAILURE);
+          rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+                  "Replica(s) " + unhealthyCores + " are currently 
initializing or recovering"));
+          return;
+      }
+      rsp.add("message", "All cores are healthy");
+    }
+
+    // All lights green, report healthy
+    rsp.add(STATUS, OK);
+  }
+
+  /**
+   * Find replicas DOWN or RECOVERING, or replicas in clusterstate that do not 
exist on local node
+   * @param clusterState clusterstate from ZK
+   * @param nodeName this node name
+   * @param allCoreNames list of all core names on current node
+   * @return list of core names that are either DOWN ore RECOVERING on 
'nodeName'
+   */
+  static List<String> findUnhealthyCores(ClusterState clusterState, String 
nodeName, Collection<String> allCoreNames) {
+    return clusterState.getCollectionsMap().values().stream()
 
 Review comment:
   The cluster state is a shell object that holds individual collection states 
that each live in different znodes. Each node watches only those collection 
states for which it hosts a replica. The rest of the collections exist as a 
lazy reference which is populated by a live read. The `getCollectionsMap()` 
method calls `CollectionRef.get()` for all collections so it will cause a live 
read to zk for all lazy references. The lazy reference can optionally cache the 
fetched state for 2 seconds (if you call `CollectionRef.get(true)`) but that 
too is too short an interval for a health check.
   
   > I want to exclude replicas of inactive shards from the check. The only 
place I could find that info was in Slice inside Clusterstate.
   
   It's more code but it is a good idea for sure. Your idea of skipping 
recovery_failed cores from the health check is also sound.
   
   Thanks for taking this up!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to