I did more deep diving and found out the following exception while it tries to replicate.
135531514-ERROR - 2014-03-07 23:08:35.454; org.apache.solr.common.SolrException; SnapPull failed :org.apache.lucene.store.AlreadyClosedException: Already closed 135531665- at org.apache.solr.core.CachingDirectoryFactory.get(CachingDirectoryFactory.java:336) 135531752- at org.apache.solr.handler.ReplicationHandler.loadReplicationProperties(ReplicationHandler.java:806) 135531854- at org.apache.solr.handler.SnapPuller.logReplicationTimeAndConfFiles(SnapPuller.java:522) 135531945- at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:464) I opened the solrcloud and found that if while ReplicationStrategy is trying to open the index directory , it encounters this exception. I searched the solr jira's and found this issue *https://issues.apache.org/jira/browse/SOLR-4960 <https://issues.apache.org/jira/browse/SOLR-4960>* closely related to mine (but do not know for sure) Can anyone familiar with the jira let me know if this issue will go away if we upgrade to 4.4? Thanks again Nitin On Fri, Mar 7, 2014 at 11:46 AM, Veera Raghavan <veera.raghavan...@gmail.com > wrote: > Forgot to attach the log during the recovery failed > > solr.log.129:1625677:ERROR - 2014-03-06 13:29:31.909; > org.apache.solr.common.SolrException; Error while trying to > recover:org.apache.solr.common.SolrException: Replication for recovery > failed. > solr.log.129-1625849- at > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:156) > solr.log.129-1625929- at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409) > solr.log.129-1626010- at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) > > > solr.log.129-1626085-INFO - 2014-03-06 13:29:31.910; > org.apache.solr.update.UpdateLog; Dropping buffered updates > FSUpdateLog{state=BUFFERING, tlog=tlog{file=/mnt/search/solr/ > testcollection_shard1_replica2/data/tlog/tlog.0000000000000000000 > refcount=1}} > > solr.log.129-1626353-ERROR - 2014-03-06 13:29:31.910; > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... > (7) core=testcollection_shard1_replica2 > > > On Fri, Mar 7, 2014 at 11:24 AM, Veera Raghavan < > veera.raghavan...@gmail.com> wrote: > >> Hi there >> >> I have a 6 node solrcloud cluster with 50 collections. All collections >> are sharded across all the 6 nodes. I am seeing a weird behavior where both >> the replicas for a shard go to down to go to a "recovering" state and >> never come back (No specific corelation to writes or reads). >> >> I manually am unloading and recreating the cores to band aid the problem >> >> In the solr logs I see this.. >> >> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null >> path=/admin/cores >> params={coreNodeName=<ip>:8983_solr_testcollection_shard1_replica1&state=recovering&nodeName=<ip>:8983_solr&action=PREPRECOVERY&checkLive=true&core=solr_testcollection_shard1_replica2&wt=javabin&onlyIfLeader=true&version=2} >> status=0 QTime=99 >> >> >> Have any of you seen this issue before? Is it a known bug that can be >> fixed with an upgrade? Should i increase the zookeeper timeout may be? >> >> >> Any pointers are much appreciated >> Thanks >> Veera >> >> >> >