Shawn Heisey-4 wrote > What are you trying to achieve with your restart? Can you just reload > the collections one by one instead?
We restart when we update a handler, schema, or solrconfig for our cores. I’ve tried just shutting down both nodes. Updating both, and restarting both. With a 1,000 replicas though both nodes take awhile to spin up each replica, figure out its state relative to SolrCloud, and spend a lot of time trying to talk with one another. Inevitably something fails, retries in 2 seconds, then 4, 8, and soon retries go out 512 seconds. It doesn’t seem that SolrCloud can handle a restart with a lot of cores without some careful orchestration. I've tried the relatively foolproof/safe approach of: 1) Unload all cores from node A (thus forcing leadership to node B) 2) Shut down, update, and restart node A 3) Re-create all cores in node A as replicas 4) Repeat 1-3 but for node B The thing is, creating the cores takes a long time - a couple seconds per core. Keep in mind that nothing is going on while doing this - no new content to synchronize and no searches are being performed. But even with 2-3 seconds per core we're talking about a fairly long process to cycle through both sets of 1,000 replicas. When I do the above, clusterstate.json appears to be kept up to date and reflects the nodes that have been created. I would expect this given we’re talking about whether or not the replica exists. What I was then trying to do is find a way to update both nodes without going through the full unload/re-create process. Avoiding the leader election process seemed to be key in a faster restart. What I was hoping to achieve was: 1) Shift leadership to all replicas on node B 2) Shut down, update, and restart node A. 3) Repeat 1-2 but swap A/B However there doesn’t appear to be a way to force leadership to/from a particular replica. Next approach was to merely shut down a node and wait for the other node to pick up all leaders by fetching clusterstate.json. 1) Shut down node A 2) Wait for leader election process to play out (leaders shift to node B) 3) Update and restart A 4) Repeat 1-3 but swap A/B With step 2 though, clusterstate.json doesn’t seem to update and reflect the leader election process that I can see play out in the log. I use http://solrhost/solr/zookeeper?path=%2Fclusterstate.json&detail=true to get clusterstate.json. In the end, this isn’t that much better or faster than my first approach (unload and create) because the leader election process still takes a couple seconds per replica. So basically three issues - and maybe I need focussing on the “right” problem: 1) Pulling the plug on SolrCloud and restarting with ~1,000 cores is iffy - many collections never start 2) There’s no way to force election off of or to a node for an orchestrated restart 3) clusterstate.json doesn’t appear to be updated (frequently) when it comes to capturing leadership -- View this message in context: http://lucene.472066.n3.nabble.com/clusterstate-json-does-not-reflect-current-state-of-down-versus-active-tp4131266p4131470.html Sent from the Solr - User mailing list archive at Nabble.com.