On 8/20/2013 10:52 PM, dmarini wrote:
I'm running a solr 4.3 cloud in a 3 machine setup that has the following
configuration:
each machine is running 3 zookeepers on different ports
each machine is running a jetty instance PER zookeeper..
Essentially, this gives us the ability to host 3 isolated clouds across the
3 machines. 3 shards per collection with each machine hosting a shard and
replicas of the other 2 shards. default timeout for the zookeeper
communication is 60 seconds. At any time I can go to any machine/port combo
and go to the "Cloud" view and everything looks peachy. All nodes are green
and each shard of each collection has an active leader (albeit they all
eventually have the SAME leader, which does stump me as to how it gets that
way but one thing at a time).
Despite everything looking good, looking at the logs on any of the nodes is
enough to make me wonder how the cloud is functioning at all, with errors
like the following:
*Error while trying to recover.
core=MyCollection.shard2.replica:org.apache.solr.client.solrj.SolrServerException:
Timeout occured while waiting response from server at:
http://MYNODE2.MYDOMAIN.LOCAL:8983/solr
* (what's funny about this one is that MYNODE2:8983/solr responds with no
issue and appears healthy (all green), but these errors are coming in 5 to
10 at a time for MYNODE1 and MYNODE3.)
*Org.apache.solr.common.SolrException: I was asked to wait on state
recovering for MYNODE3.MYDOMAIN.LOCAL:8983_solr but I still do not see the
requested state. I see state: active live:true* (this is from the leader
node: MYNODE2:8983/solr logs from the admin site.. Again, all appears ok and
read/writes to the cloud are working.)
To top it all off, we have monitors that call out to the solr/admin/ping
handler for each node of each cloud and normally these pings are very quick
(under 100ms).. but at various points throughout the day, the 60 second
timeout is surpassed for the monitor and it raises an alarm only to have the
next ping go right back to quick.
I've done checks against resource usage on the machines when I see these
ping slowdowns but I'm not seeing any memory pressure (in terms of free
memory) or cpu thrashing. I'm at a loss for what can cause the system to be
so unstable and would appreciate any thoughts on any of the messages from
the log or proposed ideas for the cause of the ping issue.
Also, to confirm, there is currently no way to force a leader election
correct? with all of our collections inevitably rolling themselves to the
same leader over time, I feel that the performance will suffer since all
writes will be trying to happen on the same machine when there are other
healthy machines that can be the leader for the other shards to allow a
better distribution of requests
I am guessing that you are running into resource starvation, mostly
memory. You've probably got a lot of slow garbage collections, and you
might even be going to swap (UNIX) or the pagefile (Windows) from
allocating too much memory to Solr instances. You may find that you
need to add memory to the machines. I wouldn't try what you are doing
without at least 16GB per server, and depending on how big those indexes
are, I might want 32 or 64GB.
The first thing I recommend is getting rid of all those extra
zookeepers. You can run many clouds on one three-node zookeeper
ensemble. You just need to have zkHost parameters like the following,
where "/test1" gets replaced by a different chroot value for each cloud.
You do not need the chroot on every server in the list, just once at
the end:
-DzkHost=server1:2181,server2:2181,server3:2181/test1
The next thing is to size the max heap appropriately for each of your
Solr instances. The total amount of RAM allocated to all the JVMs -
zookeeper and Solr - must not exceed the total memory in the server, and
you should have RAM left over for OS disk caching as well. Unless your
max heap is below 1GB, you'll also want to tune your garbage collection.
Included in the following wiki page are some good tips on memory and
garbage collection tuning:
http://wiki.apache.org/solr/SolrPerformanceProblems
Thanks,
Shawn