My guess is that recovery is indeed successful but the leader is repeatedly marking this replicas as 'down' using what we call Leader-Initiated-Recovery or LIR. We need to understand why that is happening. Are there GC issues on this new node?
Can we see the logs on the leader from a bit before and after a line that should look like the following: "Put replica core=documents_shard1_replica2 coreNodeName="... Can we also see the complete cluster state? On Thu, Sep 1, 2016 at 9:15 PM, Jon Hawkesworth < jon.hawkeswo...@medquist.onmicrosoft.com> wrote: > Hi > > > > If anyone has any suggestions of things I could try to resolve my issue > where one replica on one of my solcloud 6.0.1 shards refuses to stay up, > I'd love to hear them. In fact, I'll get you something off your amazon > wishlist, within reason, if you can solve this puzzle. > > > > Today we pruned the dead replica, restarted the machine where it ran and > once the node had rejoined the cluster, we added a new replica. > > The replica was marked as Active for about 10 minutes then went down > > > > I put some example logging from below, but it looks much the same as last > time. > > > > There's a bunch of warnings about a checksum being different even though > the file size is the same and then RecoveryStrategy > > reports 'Could not publish as ACTIVE after succesful recovery' > > > > I think I've found where that message comes from in the code here: > https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a= > blob;f=solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java;h= > abd00aef19a731b42b314f8b526cdb2d77baf89f;hb=refs/heads/master > > (I am running 6.0.1 though so could have changed in latest devel). > > > > So it seems this chunk of code⦠> > > > 451 if (successfulRecovery) { > > 452 LOG.info("Registering as Active after recovery."); > > 453 try { > > 454 zkController.publish(core.getCoreDescriptor(), > Replica.State.ACTIVE); > > 455 } catch (Exception e) { > > 456 LOG.error("Could not publish as ACTIVE after succesful > recovery", e); > > 457 successfulRecovery = false; > > 458 } > > 459 > > 460 if (successfulRecovery) { > > 461 close = true; > > 462 recoveryListener.recovered(); > > 463 } > > 464 } > > > > results in this: > > > > org.apache.solr.common.SolrException: Cannot publish state of core > 'documents_shard1_replica2' as active without recovering first! > > at org.apache.solr.cloud.ZkController.publish( > ZkController.java:1141) > > at org.apache.solr.cloud.ZkController.publish( > ZkController.java:1097) > > at org.apache.solr.cloud.ZkController.publish( > ZkController.java:1093) > > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:457) > > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:224) > > at java.util.concurrent.Executors$RunnableAdapter.call(Unknown > Source) > > at java.util.concurrent.FutureTask.run(Unknown Source) > > at org.apache.solr.common.util.ExecutorUtil$ > MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > > at java.lang.Thread.run(Unknown Source) > > > > I don't yet understand the interaction with zookeeper but there's some > disagreement about whether recovery has happened or not (if it hadn't from > solr's point of view the successfulRecovery boolean would presumably be > false. > > > > Should I raise a JIRA? Is there any other useful information I could > gather? > > > > I haven't really had any similar problems with the other 3 shards, just > shard1. > > > > The nodes that it is running on are all pretty similar - all vms built to > the same specification and the deployment of java and solrcloud is > automated so there shouldn't be any differences in the stack. > > > > Many thanks, > > > > Jon > > > > > > > > > > Example log output below > > > > > > WARN false > > > > > > IndexFetcher > > > > File _jnux.si did not match. expected checksum is 1186898951 and actual > is checksum 1994281621. expected length is 417 and actual length is 417 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.nvd did not match. expected checksum is 2200422612 and actual > is checksum 3635321041. expected length is 63 and actual length is 65 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.fdx did not match. expected checksum is 281622189 and actual is > checksum 838341528. expected length is 84 and actual length is 84 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.nvm did not match. expected checksum is 1875012021 and actual > is checksum 524812847. expected length is 108 and actual length is 108 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.fnm did not match. expected checksum is 1681449973 and actual > is checksum 3351426142. expected length is 1265 and actual length is 1265 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene54_0.dvm did not match. expected checksum is 355987228 > and actual is checksum 847034886. expected length is 380 and actual length > is 404 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene50_0.pos did not match. expected checksum is 806636274 > and actual is checksum 2272195325. expected length is 1059 and actual > length is 1172 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene50_0.doc did not match. expected checksum is 4041316671 > and actual is checksum 3122885740. expected length is 212 and actual > length is 281 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene50_0.tim did not match. expected checksum is 2891628412 > and actual is checksum 2420913910. expected length is 5346 and actual > length is 6251 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene50_0.tip did not match. expected checksum is 1652105503 > and actual is checksum 807238796. expected length is 336 and actual length > is 349 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy_Lucene54_0.dvd did not match. expected checksum is 2664049801 > and actual is checksum 2930561414. expected length is 130 and actual length > is 145 > > 9/1/2016, 12:37:06 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.fdt did not match. expected checksum is 4175958592 and actual > is checksum 3650490510. expected length is 4280 and actual length is 4983 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuy.si did not match. expected checksum is 2223401636 and actual > is checksum 734463570. expected length is 535 and actual length is 535 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene54_0.dvd did not match. expected checksum is 202072236 > and actual is checksum 4194802930. expected length is 96 and actual > length is 264 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene50_0.tip did not match. expected checksum is 2123658306 > and actual is checksum 435878007. expected length is 298 and actual length > is 639 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.nvd did not match. expected checksum is 4214748910 and actual > is checksum 3784036105. expected length is 59 and actual length is 77 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.fdt did not match. expected checksum is 3837568601 and actual > is checksum 2542454689. expected length is 896 and actual length is 20338 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.fdx did not match. expected checksum is 2070429440 and actual > is checksum 3279752998. expected length is 84 and actual length is 86 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene50_0.pos did not match. expected checksum is 2299588010 > and actual is checksum 2299553846. expected length is 190 and actual > length is 5717 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene54_0.dvm did not match. expected checksum is 914650440 > and actual is checksum 2852383192. expected length is 312 and actual length > is 548 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.nvm did not match. expected checksum is 3037735995 and actual > is checksum 4023026424. expected length is 108 and actual length is 108 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene50_0.doc did not match. expected checksum is 3813274592 > and actual is checksum 189237707. expected length is 110 and actual length > is 1945 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz_Lucene50_0.tim did not match. expected checksum is 3013245878 > and actual is checksum 2122722316. expected length is 1757 and actual > length is 16642 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.fnm did not match. expected checksum is 2117653105 and actual > is checksum 3401755804. expected length is 1265 and actual length is 1265 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnuz.si did not match. expected checksum is 2715978927 and actual > is checksum 3653125964. expected length is 535 and actual length is 535 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.fdt did not match. expected checksum is 1699019853 and actual > is checksum 3731775500. expected length is 15865 and actual length is 12728 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene50_0.pos did not match. expected checksum is 2189908204 > and actual is checksum 2338139479. expected length is 4475 and actual > length is 3431 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene50_0.doc did not match. expected checksum is 1522019614 > and actual is checksum 969681917. expected length is 1394 and actual length > is 1093 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene50_0.tim did not match. expected checksum is 813529901 > and actual is checksum 529669468. expected length is 13843 and actual > length is 12535 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.si did not match. expected checksum is 3802482417 and actual > is checksum 1865633126. expected length is 535 and actual length is 535 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene54_0.dvm did not match. expected checksum is 4236057860 > and actual is checksum 2986112802. expected length is 500 and actual length > is 476 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.fdx did not match. expected checksum is 2497099401 and actual > is checksum 990046808. expected length is 85 and actual length is 84 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.nvd did not match. expected checksum is 1736308969 and actual > is checksum 3657480551. expected length is 73 and actual length is 71 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene50_0.tip did not match. expected checksum is 1362235492 > and actual is checksum 640196019. expected length is 570 and actual length > is 531 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.fnm did not match. expected checksum is 1975043794 and actual > is checksum 1035049893. expected length is 1265 and actual length is 1265 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0.nvm did not match. expected checksum is 2985228383 and actual > is checksum 2603407196. expected length is 108 and actual length is 108 > > 9/1/2016, 12:37:07 PM > > > > WARN false > > > > > > IndexFetcher > > > > File _jnv0_Lucene54_0.dvd did not match. expected checksum is 762056409 > and actual is checksum 1514176651. expected length is 228 and actual length > is 211 > > 9/1/2016, 12:37:09 PM > > > > WARN false > > > > > > UpdateLog > > > > Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_ > replica2\data\tlog\tlog.0000000000000000005 refcount=2} active=true > starting pos=20222 > > 9/1/2016, 12:37:10 PM > > > > WARN false > > > > > > UpdateLog > > > > Log replay finished. recoveryInfo=RecoveryInfo{adds=10 deletes=0 > deleteByQuery=1 errors=0 positionOfStart=20222} > > 9/1/2016, 12:37:10 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Could not publish as ACTIVE after succesful recovery > > 9/1/2016, 12:37:10 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Recovery failed - trying again... (0) > > 9/1/2016, 12:37:37 PM > > > > WARN false > > > > > > UpdateLog > > > > Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_ > replica2\data\tlog\tlog.0000000000000000006 refcount=2} active=true > starting pos=0 > > 9/1/2016, 12:37:38 PM > > > > WARN false > > > > > > UpdateLog > > > > Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0 > deleteByQuery=0 errors=0 positionOfStart=0} > > 9/1/2016, 12:37:41 PM > > > > WARN false > > > > > > RecoveryStrategy > > > > Stopping recovery for core=[transcribedReports_shard1_replica2] > coreNodeName=[core_node14] > > 9/1/2016, 12:42:13 PM > > > > WARN false > > > > > > UpdateLog > > > > Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_ > replica2\data\tlog\tlog.0000000000000000007 refcount=2} active=true > starting pos=1748 > > 9/1/2016, 12:42:14 PM > > > > WARN false > > > > > > UpdateLog > > > > Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 > deleteByQuery=0 errors=0 positionOfStart=1748} > > 9/1/2016, 12:42:14 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Could not publish as ACTIVE after succesful recovery > > 9/1/2016, 12:42:14 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Recovery failed - trying again... (0) > > 9/1/2016, 12:42:43 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Could not publish as ACTIVE after succesful recovery > > 9/1/2016, 12:42:43 PM > > > > ERROR false > > > > > > RecoveryStrategy > > > > Recovery failed - trying again... (0) > > > > *Jon Hawkesworth* > Software Developer > > > > > > Hanley Road, Malvern, WR13 6NP. UK > > O: +44 (0) 1684 312313 > > *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com> www.mmodal.com > <http://www.medquist.com/>* > > > > *This electronic mail transmission contains confidential information > intended only for the person(s) named. Any use, distribution, copying or > disclosure by another person is strictly prohibited. If you are not the > intended recipient of this e-mail, promptly delete it and all attachments.* > > > -- Regards, Shalin Shekhar Mangar.