Hi Anshum, This is for restart solr with 1000 collections. I created an environment with 1023 collections today All collections are empty. During repeated restart test, one of the cores are marked as "recovering" and stuck there for ever. The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here is the relevant logs:
---This is the logs for the core stuck at "recovering" INFO - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController; publishing core=test_collection_112_shard1_replica2 state=down INFO - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore; [test_collection_112_shard1_replica2] CLOSING SolrCore org.apache.solr.core.SolrCore@1e48619 INFO - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore; [test_collection_112_shard1_replica2] Closing main searcher on request. INFO - 2016-05-16 22:47:06.001; org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt /solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index [CachedDir<<refCount=0;path=/mnt/solrcloud_latest/solr /test_collection_112_shard1_replica2/data/index;done=false>>]... INFO - 2016-05-16 22:47:15.745; org.apache.solr.core.CorePropertiesLocator; Found core test_collection_112_shard1_replica2 in /mnt/solrcloud_latest/solr /test_collection_112_shard1_replica2/ INFO - 2016-05-16 22:47:15.906; org.apache.solr.cloud.ZkController; publishing core=test_collection_112_shard1_replica2 state=down INFO - 2016-05-16 22:47:15.973; org.apache.solr.cloud.ZkController; waiting to find shard id in clusterstate for test_collection_112_shard1_replica2 INFO - 2016-05-16 22:47:15.974; org.apache.solr.core.CoreContainer; Creating SolrCore 'test_collection_112_shard1_replica2' using instanceDir: / mnt/solrcloud_latest/solr/test_collection_112_shard1_replica2 INFO - 2016-05-16 22:47:15.975; org.apache.solr.cloud.ZkController; Check for collection zkNode:test_collection_112 INFO - 2016-05-16 22:47:16.136; org.apache.solr.cloud.ZkController; Load collection config from:/collections/test_collection_112 INFO - 2016-05-16 22:47:16.509; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: '/mnt/solrcloud_latest/solr /test_collection_112_shard1_replica2/' INFO - 2016-05-16 22:49:18.409; org.apache.solr.core.SolrCore; [test_collection_112_shard1_replica2] Opening new SolrCore at /mnt /solrcloud_latest/solr/test_collection_112_shard1_replica2/, dataDir=/mnt /solrcloud_latest/solr//test_collection_112_shard1_replica2/data/ INFO - 2016-05-16 22:49:54.860; org.apache.solr.cloud.ZkController; Register replica - core:test_collection_112_shard1_replica2 address: http://10.10.1.8:8983/solr collection:test_collection_112 shard:shard1 INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; We are http://10.10.1.8:8983/solr/test_collection_112_shard1_replica2/ and leader is http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; No LogReplay needed for core=test_collection_112_shard1_replica2 baseURL= http://10.10.1.8:8983/solr INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; Core needs to recover:test_collection_112_shard1_replica2 INFO - 2016-05-16 22:49:55.545; org.apache.solr.cloud.RecoveryStrategy; Starting recovery process. core=test_collection_112_shard1_replica2 recoveringAfterStartup=true INFO - 2016-05-16 22:49:55.546; org.apache.solr.cloud.ZkController; publishing core=test_collection_112_shard1_replica2 state=recovering INFO - 2016-05-16 22:50:01.562; org.apache.solr.cloud.RecoveryStrategy; Attempting to PeerSync from http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ core=test_collection_112_shard1_replica2 - recoveringAfterStartup=true INFO - 2016-05-16 22:50:01.562; org.apache.solr.update.PeerSync; PeerSync: core=test_collection_112_shard1_replica2 url=http://10.10.1.8:8983/solr START replicas=[ http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/] nUpdates=100 INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; PeerSync Recovery was not successful - trying replication. core=test_collection_112_shard1_replica2 INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; Starting Replication Recovery. core=test_collection_112_shard1_replica2 INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; Begin buffering updates. core=test_collection_112_shard1_replica2 INFO - 2016-05-16 22:50:01.577; org.apache.solr.cloud.RecoveryStrategy; Attempting to replicate from http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/. core=test_collection_112_shard1_replica2 ----- After this line, there is no info about the core and the status stuck forever On the leader node, after this message, there is no logs regarding test_collection_112 after those message:: INFO - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; Sync replicas to http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ INFO - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ has no replicas INFO - 2016-05-16 22:47:07.572; org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ shard1 INFO - 2016-05-16 22:47:07.573; org.apache.solr.common.cloud.SolrZkClient; makePath: /collections/test_collection_112/leaders/shard1 INFO - 2016-05-16 22:49:59.554; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params ={coreNodeName=core_node2&onlyIfLeaderActive=true&state=recovering&nodeName=10.10.1.8:8983 _solr&action=PREPRECOVERY&checkLive=true&core=test_collection_112_shard1_replica1 &wt=javabin&onlyIfLeader=true&version=2} status=0 QTime=4001 Is there any known bug? all collections are empty. Thanks, Li On Mon, May 16, 2016 at 12:50 PM, Anshum Gupta <ans...@anshumgupta.net> wrote: > I think you are approaching the problem all wrong. This seems, what is > described as an x-y problem (https://people.apache.org/~hossman/#xyproblem > ). > Can you tell us more about : > * What's your setup like? SolrCloud - Version, number of shards, is there > any custom code, etc. > * Did you start seeing this more recently? If so, what did you change? > > To already answer your question, there is no way in SolrCloud to disable or > remove the concept of 'leaders'. However, there would be other ways to fix > your setup, and get rid of the issues you are facing once you share more > details. > > > On Mon, May 16, 2016 at 12:33 PM, Li Ding <li.d...@bloomreach.com> wrote: > > > Hi all, > > > > We have an unique scenario where we don't need leaders in every > collection > > to recover from failures. The indexing never changes. But we have faced > > problems where either zk marked a core as down while the core is fine in > > non-distributed query or during restart, the core never comes up. My > > question is that is there any simple way to disable those leaders and > > leaders election in SolrCloud, We do use multi-shard and distributed > > queries. But with our unique situation, we don't need leaders to > maintain > > the correct status of the index. So if we can get rid of that part, our > > solr restart will be more robust. > > > > Any suggestions will be appreciated. > > > > Thanks, > > > > Li > > > > > > -- > Anshum Gupta >