bq: Would a clean shutdown result in the node to be flagged as down in the cluster state straight away?
It should, if it's truly clean. HOWEVER..... a "clean shutdown" is unfortunately not just a "bin/solr stop" because of the timeout Shawn mentioned, see SOLR-9371. It's a simple edit to make it much longer, but the real fix should poll. The "smoking gun" would be a correlation between the node not being marked as down in state.json and a message when you stop the instance with bin/solr about "forcefully killing ....." After only 5 seconds, that script forcefully kills the instance of Solr which would _not_ flag the replicas it hosts as down. After an interval, you should see it disappear from the "live nodes" znode though. The problem of course is that part of graceful shutdown is each replica updating the associated state.json, and they don't get a chance. ZK will periodically ping the Solr instance and if it times out remove the associated znode in "live nodes".... Solr code checks both the state.json and live_nodes to know whether a node is truly functioning, being absent from live_nodes trumps whatever state is in state.json. Best, Erick On Sat, Oct 22, 2016 at 1:00 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote: > Thanks, that was what I was hoping for I just didn't see any indication for > that in the normal log output. > > The reason for asking is that I have a SolrCloud 6.2.1 setup and when ripple > restarting the nodes I sometimes get errors. So far I have seen two > different things: > 1) The node starts up again and is able to receive new replicas but all > existing replicas are broken. > 2) All nodes come up and no problems are seen in the cluster status but the > admin UI on one node claims that a file for one config set is missing. > Restarting the node resolves the issue. > > This looked to me like the node is not going down cleanly. Would a clean > shutdown result in the node to be flagged as down in the cluster state > straight away? So far the ZooKeeper data gets only updated once the node > comes up again and reports itself as down before the recovery starts. > > On 21.10.2016 15:01, Shawn Heisey wrote: >> >> On 10/21/2016 6:56 AM, Hendrik Haddorp wrote: >>> >>> I'm running solrcloud in foreground mode (-f). Does it make a >>> difference for Solr if I stop it by pressing ctrl-c, sending it a >>> SIGTERM or using "solr stop"? >> >> All of those should produce the same result in the end -- Solr's >> shutdown hook will be called and a graceful shutdown will commence. >> >> Note that in the case of the "bin/solr stop" command, the default is to >> only wait five seconds for graceful shutdown before proceeding to a >> forced kill, which for a typical install, means that forced kills become >> the norm rather than the exception. We have an issue to increase the >> max timeout, but it hasn't been done yet. >> >> I strongly recommend anyone going into production should edit the script >> to increase the timeout. For the shell script I would do at least 60 >> seconds. The Windows script just does a pause, not an intelligent wait, >> so going that high probably isn't advisable on Windows. >> >> Thanks, >> Shawn >> >