Re: clusterstate.json does not reflect current state of down versus active

Shawn Heisey Tue, 15 Apr 2014 08:47:40 -0700

On 4/15/2014 8:58 AM, Rich Mayfield wrote:
> I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was
> hoping to use clusterstate.json would reflect the up/down state of each
> core as well as whether or not a given core was leader.
>
> clusterstate.json is not kept up to date with what I see going on in my
> logs though - I see the leader election process play out. I would expect
> that "state" would show "down" immediately for replicas on the node that I
> have shut down.
>
> Eventually, after about 30 minutes, all of the leader election processes
> complete and clusterstate.json gets updated to the true state for each
> replica.
>
> Why does it take so long for clusterstate.json to reflect the correct
> state? Is there a better way to determine the state of the system?
>
> (In my case, each node has upwards of 1,000 1-shard collections. There are
> two nodes in the cluster - each collection has 2 replicas.)


First, I'll admit that my experience with SolrCloud is not as extensive
as my experience with non-cloud installs.  I do have a SolrCloud (4.2.1)
install, but it's a the smallest possible redundant setup -- three
servers, two run Solr and Zookeeper, the third runs Zookeeper only.

What are you trying to achieve with your restart?  Can you just reload
the collections one by one instead?

Assuming that reloading isn't going to work for some reason (rebooting
for OS updates is one possibility), we need to determine why it takes so
long for a node to stabilize.

Here's a bunch of info about performance problems with Solr.  I wrote
it, so we can discuss it in depth if you like:

http://wiki.apache.org/solr/SolrPerformanceProblems

I have three possible suspicions for the root of your problem.  It is
likely to be one of them, but it could be a combination of any or all of
them.  Because this happens at startup, I don't think it's likely that
you're dealing with a GC problem caused by a very large heap.

1) The system is replaying 1000 transaction logs (possibly large, one
for each core) at startup, and also possibly initiating index recovery
using replication.  2) You don't have enough RAM to cache your index
effectively.  3) Your java heap is too small.

If your zookeeper ensemble does not use separate disks from your Solr
data (or separate servers), there could be an issue with zookeeper
client timeouts that's completely separate from any other problems.

I haven't addressed the fact that your cluster state doesn't update
quickly.  This might be a bug, but if we can deal with the slow
startup/stabilization first, then we can see whether there's anything
left to deal with on the cluster state.

Thanks,
Shawn

Re: clusterstate.json does not reflect current state of down versus active

Reply via email to