[ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023039#comment-17023039 ]
Houston Putman commented on SOLR-14210: --------------------------------------- [~shalin], that's a good idea. No need for more handlers. One handler that checks health at different levels is exactly what we want. [~janhoy] , unfortunately k8s does not separate "is this ready for traffic?" and "is this ready for the next pod to be restarted?". It would be wonderful if there was a restartableProbe, but it will probably be hard to get that into k8s anytime soon. Luckily there is a parameter for Services "{{publishNotReadyAddresses: true"}} that will allow your Service objects exposing your Solr clouds to send requests to pods that aren't ready. So this lets us treat the "readyProbe" much more like a "restartableProbe". It's kind of a mis-use of kubernetes, but you do what you have to do. As per restarting multiple nodes at the same time that don't host the same shards, that would be awesome. I did some research into it and I don't think we would be able to handle it with just StatefulSets. We could probably build something into the [solr operator|https://github.com/bloomberg/solr-operator], but k8s doesn't let you really control parallel statefulSet pod upgrades very well. I think the only way it could be done is using the [on-delete update strategy|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#on-delete] and have the solr-operator (or any controller for a solr CRD) smartly pick which pods it can delete in parallel without taking down multiple replicas of any shard. I do agree it would be awesome to add functionality in this handler that could eventually make those parallel operations easier. > Introduce Node-level status handler for replicas > ------------------------------------------------ > > Key: SOLR-14210 > URL: https://issues.apache.org/jira/browse/SOLR-14210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: master (9.0), 8.5 > Reporter: Houston Putman > Priority: Major > > h2. Background > As was brought up in SOLR-13055, in order to run Solr in a more cloud-native > way, we need some additional features around node-level healthchecks. > {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe > explained in > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] > determine if a node is live and ready to serve live traffic. > {quote} > > However there are issues around kubernetes managing it's own rolling > restarts. With the current healthcheck setup, it's easy to envision a > scenario in which Solr reports itself as "healthy" when all of its replicas > are actually recovering. Therefore kubernetes, seeing a healthy pod would > then go and restart the next Solr node. This can happen until all replicas > are "recovering" and none are healthy. (maybe the last one restarted will be > "down", but still there are no "active" replicas) > h2. Proposal > I propose we make an additional healthcheck handler that returns whether all > replicas hosted by that Solr node are healthy and "active". That way we will > be able to use the [default kubernetes rolling restart > logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] > with Solr. > To add on to [Jan's point > here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], > this handler should be more friendly for other Content-Types and should use > bettter HTTP response statuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org