[jira] [Commented] (SOLR-14210) Introduce Node-level status handler for replicas

Houston Putman (Jira) Fri, 24 Jan 2020 07:52:26 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023039#comment-17023039
 ]


Houston Putman commented on SOLR-14210:
---------------------------------------

[~shalin], that's a good idea. No need for more handlers. One handler that 
checks health at different levels is exactly what we want.

[~janhoy] , unfortunately k8s does not separate "is this ready for traffic?" 
and "is this ready for the next pod to be restarted?". It would be wonderful if 
there was a restartableProbe, but it will probably be hard to get that into k8s 
anytime soon. Luckily there is a parameter for Services 
"{{publishNotReadyAddresses: true"}} that will allow your Service objects 
exposing your Solr clouds to send requests to pods that aren't ready. So this 
lets us treat the "readyProbe" much more like a "restartableProbe". It's kind 
of a mis-use of kubernetes, but you do what you have to do.

As per restarting multiple nodes at the same time that don't host the same 
shards, that would be awesome. I did some research into it and I don't think we 
would be able to handle it with just StatefulSets. We could probably build 
something into the [solr operator|https://github.com/bloomberg/solr-operator], 
but k8s doesn't let you really control parallel statefulSet pod upgrades very 
well. I think the only way it could be done is using the [on-delete update 
strategy|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#on-delete]
 and have the solr-operator (or any controller for a solr CRD) smartly pick 
which pods it can delete in parallel without taking down multiple replicas of 
any shard. I do agree it would be awesome to add functionality in this handler 
that could eventually make those parallel operations easier.

> Introduce Node-level status handler for replicas
> ------------------------------------------------
>
>                 Key: SOLR-14210
>                 URL: https://issues.apache.org/jira/browse/SOLR-14210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (9.0), 8.5
>            Reporter: Houston Putman
>            Priority: Major
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native 
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe 
> explained in 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
>  determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling 
> restarts. With the current healthcheck setup, it's easy to envision a 
> scenario in which Solr reports itself as "healthy" when all of its replicas 
> are actually recovering. Therefore kubernetes, seeing a healthy pod would 
> then go and restart the next Solr node. This can happen until all replicas 
> are "recovering" and none are healthy. (maybe the last one restarted will be 
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all 
> replicas hosted by that Solr node are healthy and "active". That way we will 
> be able to use the [default kubernetes rolling restart 
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
>  with Solr.
> To add on to [Jan's point 
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
>  this handler should be more friendly for other Content-Types and should use 
> bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14210) Introduce Node-level status handler for replicas

Reply via email to