[ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022893#comment-17022893 ]
Jan Høydahl commented on SOLR-14210: ------------------------------------ bq. But I'm not sure if I understand how "the deep check assures that requests will likely succeed" would work. I guess your check for all cores on same node is in OK state qualifies as "deep". The node health must be node level. Not sure about readiness - some collections may be ready while other collections are recovering, would you then tell k8s to send traffic to the pod or not? And what is the exact check k8s does before restarting a node in a rolling restart? Is there a separate probe for when a certain pod/node can be restarted? For Solr it could be that a node hosts the only live replica for a shard since the other replica on another pod/node is currently recovering, but how would k8s know that? If we had a probe /admin/info/health?safeToRestart then that could check clusterstate first? > Introduce Node-level status handler for replicas > ------------------------------------------------ > > Key: SOLR-14210 > URL: https://issues.apache.org/jira/browse/SOLR-14210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: master (9.0), 8.5 > Reporter: Houston Putman > Priority: Major > > h2. Background > As was brought up in SOLR-13055, in order to run Solr in a more cloud-native > way, we need some additional features around node-level healthchecks. > {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe > explained in > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] > determine if a node is live and ready to serve live traffic. > {quote} > > However there are issues around kubernetes managing it's own rolling > restarts. With the current healthcheck setup, it's easy to envision a > scenario in which Solr reports itself as "healthy" when all of its replicas > are actually recovering. Therefore kubernetes, seeing a healthy pod would > then go and restart the next Solr node. This can happen until all replicas > are "recovering" and none are healthy. (maybe the last one restarted will be > "down", but still there are no "active" replicas) > h2. Proposal > I propose we make an additional healthcheck handler that returns whether all > replicas hosted by that Solr node are healthy and "active". That way we will > be able to use the [default kubernetes rolling restart > logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] > with Solr. > To add on to [Jan's point > here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], > this handler should be more friendly for other Content-Types and should use > bettter HTTP response statuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org