Hi, I'm running a solr 4.3 cloud in a 3 machine setup that has the following configuration: each machine is running 3 zookeepers on different ports each machine is running a jetty instance PER zookeeper..
Essentially, this gives us the ability to host 3 isolated clouds across the 3 machines. 3 shards per collection with each machine hosting a shard and replicas of the other 2 shards. default timeout for the zookeeper communication is 60 seconds. At any time I can go to any machine/port combo and go to the "Cloud" view and everything looks peachy. All nodes are green and each shard of each collection has an active leader (albeit they all eventually have the SAME leader, which does stump me as to how it gets that way but one thing at a time). Despite everything looking good, looking at the logs on any of the nodes is enough to make me wonder how the cloud is functioning at all, with errors like the following: *Error while trying to recover. core=MyCollection.shard2.replica:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://MYNODE2.MYDOMAIN.LOCAL:8983/solr * (what's funny about this one is that MYNODE2:8983/solr responds with no issue and appears healthy (all green), but these errors are coming in 5 to 10 at a time for MYNODE1 and MYNODE3.) *Org.apache.solr.common.SolrException: I was asked to wait on state recovering for MYNODE3.MYDOMAIN.LOCAL:8983_solr but I still do not see the requested state. I see state: active live:true* (this is from the leader node: MYNODE2:8983/solr logs from the admin site.. Again, all appears ok and read/writes to the cloud are working.) To top it all off, we have monitors that call out to the solr/admin/ping handler for each node of each cloud and normally these pings are very quick (under 100ms).. but at various points throughout the day, the 60 second timeout is surpassed for the monitor and it raises an alarm only to have the next ping go right back to quick. I've done checks against resource usage on the machines when I see these ping slowdowns but I'm not seeing any memory pressure (in terms of free memory) or cpu thrashing. I'm at a loss for what can cause the system to be so unstable and would appreciate any thoughts on any of the messages from the log or proposed ideas for the cause of the ping issue. Also, to confirm, there is currently no way to force a leader election correct? with all of our collections inevitably rolling themselves to the same leader over time, I feel that the performance will suffer since all writes will be trying to happen on the same machine when there are other healthy machines that can be the leader for the other shards to allow a better distribution of requests Desperate for a stable cloud, thanks in advance for any help. --Dave -- View this message in context: http://lucene.472066.n3.nabble.com/4-3-Cloud-looks-good-on-the-outside-but-lots-of-errors-in-the-logs-tp4085806.html Sent from the Solr - User mailing list archive at Nabble.com.