[
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023039#comment-17023039
]
Houston Putman commented on SOLR-14210:
---------------------------------------
[~shalin], that's a good idea. No need for more handlers. One handler that
checks health at different levels is exactly what we want.
[~janhoy] , unfortunately k8s does not separate "is this ready for traffic?"
and "is this ready for the next pod to be restarted?". It would be wonderful if
there was a restartableProbe, but it will probably be hard to get that into k8s
anytime soon. Luckily there is a parameter for Services
"{{publishNotReadyAddresses: true"}} that will allow your Service objects
exposing your Solr clouds to send requests to pods that aren't ready. So this
lets us treat the "readyProbe" much more like a "restartableProbe". It's kind
of a mis-use of kubernetes, but you do what you have to do.
As per restarting multiple nodes at the same time that don't host the same
shards, that would be awesome. I did some research into it and I don't think we
would be able to handle it with just StatefulSets. We could probably build
something into the [solr operator|https://github.com/bloomberg/solr-operator],
but k8s doesn't let you really control parallel statefulSet pod upgrades very
well. I think the only way it could be done is using the [on-delete update
strategy|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#on-delete]
and have the solr-operator (or any controller for a solr CRD) smartly pick
which pods it can delete in parallel without taking down multiple replicas of
any shard. I do agree it would be awesome to add functionality in this handler
that could eventually make those parallel operations easier.
> Introduce Node-level status handler for replicas
> ------------------------------------------------
>
> Key: SOLR-14210
> URL: https://issues.apache.org/jira/browse/SOLR-14210
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: master (9.0), 8.5
> Reporter: Houston Putman
> Priority: Major
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe
> explained in
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
> determine if a node is live and ready to serve live traffic.
> {quote}
>
> However there are issues around kubernetes managing it's own rolling
> restarts. With the current healthcheck setup, it's easy to envision a
> scenario in which Solr reports itself as "healthy" when all of its replicas
> are actually recovering. Therefore kubernetes, seeing a healthy pod would
> then go and restart the next Solr node. This can happen until all replicas
> are "recovering" and none are healthy. (maybe the last one restarted will be
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all
> replicas hosted by that Solr node are healthy and "active". That way we will
> be able to use the [default kubernetes rolling restart
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
> with Solr.
> To add on to [Jan's point
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
> this handler should be more friendly for other Content-Types and should use
> bettter HTTP response statuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]