I've simplified things from my previous email, and I'm still seeing errors.
Using solr 4.4.0 with two nodes, starting with a single shard. Collection is named "marin", host names are dumbo and solrcloud1. I bring up an empty cloud and index 50 documents. I can query them and everything looks fine. This is clusterstate.json at that point: {"marin":{ "shards":{"shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "dumbo:8983_solr_marin":{ "state":"active", "core":"marin", "node_name":"dumbo:8983_solr", "base_url":"http://dumbo:8983/solr", "leader":"true"}, "solrcloud1:8983_solr_marin":{ "state":"active", "core":"marin", "node_name":"solrcloud1:8983_solr", "base_url":"http://solrcloud1:8983/solr"}}}}, "router":"compositeId"}} I attempt to split with http://dumbo:8983/solr/admin/collections?action=SPLITSHARD&collection=marin&shard=shard1 After 127559ms, that call returns with org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:I was asked to wait on state active for solrcloud1:8983_solr but I still do not see the requested state. I see state: recovering live:true clusterstate.json at this point: {"marin":{ "shards":{ "shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "dumbo:8983_solr_marin":{ "state":"active", "core":"marin", "node_name":"dumbo:8983_solr", "base_url":"http://dumbo:8983/solr", "leader":"true"}, "solrcloud1:8983_solr_marin":{ "state":"active", "core":"marin", "node_name":"solrcloud1:8983_solr", "base_url":"http://solrcloud1:8983/solr"}}}, "shard1_0":{ "range":"80000000-ffffffff", "state":"construction", "replicas":{ "dumbo:8983_solr_marin_shard1_0_replica1":{ "state":"active", "core":"marin_shard1_0_replica1", "node_name":"dumbo:8983_solr", "base_url":"http://dumbo:8983/solr", "leader":"true"}, "solrcloud1:8983_solr_marin_shard1_0_replica2":{ "state":"active", "core":"marin_shard1_0_replica2", "node_name":"solrcloud1:8983_solr", "base_url":"http://solrcloud1:8983/solr"}}}, "shard1_1":{ "range":"0-7fffffff", "state":"construction", "replicas":{ "dumbo:8983_solr_marin_shard1_1_replica1":{ "state":"active", "core":"marin_shard1_1_replica1", "node_name":"dumbo:8983_solr", "base_url":"http://dumbo:8983/solr", "leader":"true"}, "solrcloud1:8983_solr_marin_shard1_1_replica2":{ "state":"recovering", "core":"marin_shard1_1_replica2", "node_name":"solrcloud1:8983_solr", "base_url":"http://solrcloud1:8983/solr"}}}}, "router":"compositeId"}} In the logs on dumbo, I see several of these: 290391 [qtp243983770-60] INFO org.apache.solr.update.processor.LogUpdateProcessor – [marin_shard1_1_replica1] webapp=/solr path=/update params={waitSearcher=true&openSearcher=false&commit=true&wt=javabin&commit_end_point=true&version=2&softCommit=false} {} 0 2 290392 [qtp243983770-60] ERROR org.apache.solr.core.SolrCore – java.io.IOException: cannot uncache file="_1.nvm": it was separately also created in the delegate directory at org.apache.lucene.store.NRTCachingDirectory.unCache(NRTCachingDirectory.java:297) at org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:216) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4109) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2809) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2897) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2872) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:549) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) and then finally this: 406671 [qtp243983770-22] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: I was asked to wait on state active for solrcloud1:8983_solr but I still do not see the requested state. I see state: recovering live:true at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) On solrcloud1 I see several of these: 259170 [RecoveryThread] INFO org.apache.solr.update.PeerSync – PeerSync: core=marin_shard1_0_replica2 url=http://solrcloud1:8983/solr START replicas=[http://dumbo:8983/solr/marin_shard1_0_replica1/] nUpdates=100 259192 [RecoveryThread] WARN org.apache.solr.update.PeerSync – PeerSync: core=marin_shard1_0_replica2 url=http://solrcloud1:8983/solr exception talking to http://dumbo:8983/solr/marin_shard1_0_replica1/, failed org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://dumbo:8983/solr/marin_shard1_0_replica1 returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) and the same messages for marin_shard1_1_replica1. I can post my solrconfig.xml or schema.xml if that would be useful. Maybe I'll try switching to the example configs and see if I can reproduce the issue. Otherwise, I'm kinda stumped here. Any suggestions? I can reproduce this consistently in under 5 minutes, so I'm happy to try ideas. Thanks. -Greg