Re: clusterstate.json does not reflect current state of down versus active

Rich Mayfield Wed, 16 Apr 2014 07:03:47 -0700

Shawn Heisey-4 wrote
> What are you trying to achieve with your restart?  Can you just reload
> the collections one by one instead?

We restart when we update a handler, schema, or solrconfig for our cores.

I’ve tried just shutting down both nodes. Updating both, and restarting
both. With a 1,000 replicas though both nodes take awhile to spin up each
replica, figure out its state relative to SolrCloud, and spend a lot of time
trying to talk with one another. Inevitably something fails, retries in 2
seconds, then 4, 8, and soon retries go out 512 seconds. It doesn’t seem
that SolrCloud can handle a restart with a lot of cores without some careful
orchestration.

I've tried the relatively foolproof/safe approach of:

1) Unload all cores from node A (thus forcing leadership to node B)
2) Shut down, update, and restart node A
3) Re-create all cores in node A as replicas
4) Repeat 1-3 but for node B

The thing is, creating the cores takes a long time - a couple seconds per
core. Keep in mind that nothing is going on while doing this - no new
content to synchronize and no searches are being performed. But even with
2-3 seconds per core we're talking about a fairly long process to cycle
through both sets of 1,000 replicas.

When I do the above, clusterstate.json appears to be kept up to date and
reflects the nodes that have been created. I would expect this given we’re
talking about whether or not the replica exists.

What I was then trying to do is find a way to update both nodes without
going through the full unload/re-create process. Avoiding the leader
election process seemed to be key in a faster restart.

What I was hoping to achieve was:

1) Shift leadership to all replicas on node B
2) Shut down, update, and restart node A.
3) Repeat 1-2 but swap A/B

However there doesn’t appear to be a way to force leadership to/from a
particular replica.

Next approach was to merely shut down a node and wait for the other node to
pick up all leaders by fetching clusterstate.json.

1) Shut down node A
2) Wait for leader election process to play out (leaders shift to node B)
3) Update and restart A
4) Repeat 1-3 but swap A/B

With step 2 though, clusterstate.json doesn’t seem to update and reflect the
leader election process that I can see play out in the log. I use
http://solrhost/solr/zookeeper?path=%2Fclusterstate.json&detail=true to get
clusterstate.json. In the end, this isn’t that much better or faster than my
first approach (unload and create) because the leader election process still
takes a couple seconds per replica.

So basically three issues - and maybe I need focussing on the “right”
problem:

1) Pulling the plug on SolrCloud and restarting with ~1,000 cores is iffy -
many collections never start
2) There’s no way to force election off of or to a node for an orchestrated
restart
3) clusterstate.json doesn’t appear to be updated (frequently) when it comes
to capturing leadership

--
View this message in context:
http://lucene.472066.n3.nabble.com/clusterstate-json-does-not-reflect-current-state-of-down-versus-active-tp4131266p4131470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: clusterstate.json does not reflect current state of down versus active

Reply via email to