Looks like there's room for improvement. I too would want the desired state to be reflected in ZK first before attempting to make it happen. Remove live_nodes first, then iterate the local replicas to be state=DOWN, then close down all the things.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Wed, Mar 29, 2023 at 9:16 AM Jan Høydahl <jan....@cominvent.com> wrote: > Hi, > > Trying to prevent traffic being sent to a Solr node that is going to shut > down, to avoid interruption of service as seen from various clients. > First part of the puzzle is signaling to any (external) load balancer to > stop sending requests to the node. > The other part is having SolrJ understand that the node is being stopped, > and not routing internal requests to cores on the node. > > Does anyone have a good command of the Shutdown logic in Solr? > My understanding is a bit sparse, but here's what I can see in the code: > > bin/solr stop will send a STOP command to Jetty's STOP_PORT with > (not-so-secret) stop key > Jetty starts the shutdown process, destroying all servlets and filters, > including Solr's dispatchFilter > Solr is notified about the shutdown through a callback in > CoreContainerProvider. > CoreContainerProvider#close() is called which calls CC#shutdown > CC shuts down every core on the node and then calls zkController#preClose > ZkController#preClose removes ephemeral live_nodes/myNode and then > publishes down state in state.json > Wait for shutdown of executors mm and let Jetty exit > > I could have got it wrong though. > > I was hoping that a Solr node would first publish itself as "not ready" in > ZK before rejecting requests, but seems as this is all reversed, since > shutdown is initiated by Jetty? > So could we instead register our own shutdown-port in Solr, and let our > bin/solr script trigger that one? There we could orchestrate the shutdown > as we want: > > Remove live_nodes znode in ZK > Publish itself as not ready on api/node/health handler (or a new > api/node/ready?) > Sleep for a few seconds (or longer with an optional &shutdownDelay > argument to our shutdown endpoint) > trigger server.stop() to take down Jetty and kill the servlet > > I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a > technical solution. > The primary goal is to drain traffic right before shutting a node down, > but it could also be designed as a generic Readiness Probe < > https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes> > modeled from Kubernetes? > I'm also aware that any solr client should be prepared to hit a dead node > due to network/power events, and retry. But it won't hurt to be graceful > whenever we can.. > > Happy to hear your thoughts. Is this a made-up problem? > > Jan