My guess is that you are hitting garbage collection issues on those shards that are going into recovery. If a leader tries to contact a follower in a single shard and times out it effectively says "that one must be gone, let's put it into recovery". Look for LeaderInitiatedRecovery (don't remember whether there are spaces etc. in that though) in the Solr logs on both the leader and follower.
Next I'd turn on GC logging and look for stop-the-world recovery events that take a long time, GCViewer is a nice tool for looking at those IIRC. Zookeeper also periodically pings the Solr nodes and if ZK can't get a response (again possibly due to excessive GC) it'll signal that the node is down. If that happens,though, I'd expect multiple replicas on a particular Solr instance to go into recovery. And finally you can consider lengthening the timeouts Best, Erick On Sat, Jun 25, 2016 at 1:18 PM, Roshan Kamble <roshan.kam...@smartstreamrdu.com> wrote: > Hello, > > I am using solr 6.0.0 in SolCloud mode with 3 nodes, one zookeeper and 3 > shard and 2 replica per collection. > > Getting below error for some insert/update when trying to insert documents to > Solr. > > And it has been observed that few shard are in either recovery or fail > recovery state. (Atleast one shard is up) > > > org.apache.solr.common.SolrException: Could not load collection from ZK: > MY_COLLECTION > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:969) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(ZkStateReader.java:519) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(ClusterState.java:189) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ClusterState.hasCollection(ClusterState.java:119) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.client.solrj.impl.CloudSolrClient.getCollectionNames(CloudSolrClient.java:1111) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:833) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:806) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_60] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [?:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60] > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /collections/ MY_COLLECTION /state.json > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:348) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:345) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:980) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:967) > ~[solr-solrj-6.0.0.jar:6.0.0 48c80f91b8e5cd9b3a9b48e6184bd53e7619e7e3 - > nknize - 2016-04-01 14:41:50] > ... 16 more > > > Regards, > > Roshan > ________________________________ > The information in this email is confidential and may be legally privileged. > It is intended solely for the addressee. Access to this email by anyone else > is unauthorised. If you are not the intended recipient, any disclosure, > copying, distribution or any action taken or omitted to be taken in reliance > on it, is prohibited and may be unlawful.