4.3 Cloud looks good on the outside, but lots of errors in the logs

dmarini Tue, 20 Aug 2013 21:53:23 -0700

Hi,

I'm running a solr 4.3 cloud in a 3 machine setup that has the following
configuration:
each machine is running 3 zookeepers on different ports
each machine is running a jetty instance PER zookeeper..

Essentially, this gives us the ability to host 3 isolated clouds across the
3 machines. 3 shards per collection with each machine hosting a shard and
replicas of the other 2 shards. default timeout for the zookeeper
communication is 60 seconds. At any time I can go to any machine/port combo
and go to the "Cloud" view and everything looks peachy. All nodes are green
and each shard of each collection has an active leader (albeit they all
eventually have the SAME leader, which does stump me as to how it gets that
way but one thing at a time).

Despite everything looking good, looking at the logs on any of the nodes is
enough to make me wonder how the cloud is functioning at all, with errors
like the following:

*Error while trying to recover.
core=MyCollection.shard2.replica:org.apache.solr.client.solrj.SolrServerException:
Timeout occured while waiting response from server at:
http://MYNODE2.MYDOMAIN.LOCAL:8983/solr
* (what's funny about this one is that MYNODE2:8983/solr responds with no
issue and appears healthy (all green), but these errors are coming in 5 to
10 at a time for MYNODE1 and MYNODE3.)

*Org.apache.solr.common.SolrException: I was asked to wait on state
recovering for MYNODE3.MYDOMAIN.LOCAL:8983_solr but I still do not see the
requested state. I see state: active live:true* (this is from the leader
node: MYNODE2:8983/solr logs from the admin site.. Again, all appears ok and
read/writes to the cloud are working.)

To top it all off, we have monitors that call out to the solr/admin/ping
handler for each node of each cloud and normally these pings are very quick
(under 100ms).. but at various points throughout the day, the 60 second
timeout is surpassed for the monitor and it raises an alarm only to have the
next ping go right back to quick.

I've done checks against resource usage on the machines when I see these
ping slowdowns but I'm not seeing any memory pressure (in terms of free
memory) or cpu thrashing. I'm at a loss for what can cause the system to be
so unstable and would appreciate any thoughts on any of the messages from
the log or proposed ideas for the cause of the ping issue.

Also, to confirm, there is currently no way to force a leader election
correct? with all of our collections inevitably rolling themselves to the
same leader over time, I feel that the performance will suffer since all
writes will be trying to happen on the same machine when there are other
healthy machines that can be the leader for the other shards to allow a
better distribution of requests

Desperate for a stable cloud, thanks in advance for any help.

--Dave

--
View this message in context:
http://lucene.472066.n3.nabble.com/4-3-Cloud-looks-good-on-the-outside-but-lots-of-errors-in-the-logs-tp4085806.html
Sent from the Solr - User mailing list archive at Nabble.com.

4.3 Cloud looks good on the outside, but lots of errors in the logs

Reply via email to