Hi If anyone has any suggestions of things I could try to resolve my issue where one replica on one of my solcloud 6.0.1 shards refuses to stay up, I'd love to hear them. In fact, I'll get you something off your amazon wishlist, within reason, if you can solve this puzzle.
Today we pruned the dead replica, restarted the machine where it ran and once the node had rejoined the cluster, we added a new replica. The replica was marked as Active for about 10 minutes then went down I put some example logging from below, but it looks much the same as last time. There's a bunch of warnings about a checksum being different even though the file size is the same and then RecoveryStrategy reports 'Could not publish as ACTIVE after succesful recovery' I think I've found where that message comes from in the code here: https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java;h=abd00aef19a731b42b314f8b526cdb2d77baf89f;hb=refs/heads/master (I am running 6.0.1 though so could have changed in latest devel). So it seems this chunk of code... 451 if (successfulRecovery) { 452 LOG.info("Registering as Active after recovery."); 453 try { 454 zkController.publish(core.getCoreDescriptor(), Replica.State.ACTIVE); 455 } catch (Exception e) { 456 LOG.error("Could not publish as ACTIVE after succesful recovery", e); 457 successfulRecovery = false; 458 } 459 460 if (successfulRecovery) { 461 close = true; 462 recoveryListener.recovered(); 463 } 464 } results in this: org.apache.solr.common.SolrException: Cannot publish state of core 'documents_shard1_replica2' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1141) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1097) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1093) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:457) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I don't yet understand the interaction with zookeeper but there's some disagreement about whether recovery has happened or not (if it hadn't from solr's point of view the successfulRecovery boolean would presumably be false. Should I raise a JIRA? Is there any other useful information I could gather? I haven't really had any similar problems with the other 3 shards, just shard1. The nodes that it is running on are all pretty similar - all vms built to the same specification and the deployment of java and solrcloud is automated so there shouldn't be any differences in the stack. Many thanks, Jon Example log output below WARN false IndexFetcher File _jnux.si did not match. expected checksum is 1186898951 and actual is checksum 1994281621. expected length is 417 and actual length is 417 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy.nvd did not match. expected checksum is 2200422612 and actual is checksum 3635321041. expected length is 63 and actual length is 65 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy.fdx did not match. expected checksum is 281622189 and actual is checksum 838341528. expected length is 84 and actual length is 84 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy.nvm did not match. expected checksum is 1875012021 and actual is checksum 524812847. expected length is 108 and actual length is 108 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy.fnm did not match. expected checksum is 1681449973 and actual is checksum 3351426142. expected length is 1265 and actual length is 1265 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene54_0.dvm did not match. expected checksum is 355987228 and actual is checksum 847034886. expected length is 380 and actual length is 404 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene50_0.pos did not match. expected checksum is 806636274 and actual is checksum 2272195325. expected length is 1059 and actual length is 1172 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene50_0.doc did not match. expected checksum is 4041316671 and actual is checksum 3122885740. expected length is 212 and actual length is 281 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene50_0.tim did not match. expected checksum is 2891628412 and actual is checksum 2420913910. expected length is 5346 and actual length is 6251 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene50_0.tip did not match. expected checksum is 1652105503 and actual is checksum 807238796. expected length is 336 and actual length is 349 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy_Lucene54_0.dvd did not match. expected checksum is 2664049801 and actual is checksum 2930561414. expected length is 130 and actual length is 145 9/1/2016, 12:37:06 PM WARN false IndexFetcher File _jnuy.fdt did not match. expected checksum is 4175958592 and actual is checksum 3650490510. expected length is 4280 and actual length is 4983 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuy.si did not match. expected checksum is 2223401636 and actual is checksum 734463570. expected length is 535 and actual length is 535 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene54_0.dvd did not match. expected checksum is 202072236 and actual is checksum 4194802930. expected length is 96 and actual length is 264 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene50_0.tip did not match. expected checksum is 2123658306 and actual is checksum 435878007. expected length is 298 and actual length is 639 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.nvd did not match. expected checksum is 4214748910 and actual is checksum 3784036105. expected length is 59 and actual length is 77 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.fdt did not match. expected checksum is 3837568601 and actual is checksum 2542454689. expected length is 896 and actual length is 20338 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.fdx did not match. expected checksum is 2070429440 and actual is checksum 3279752998. expected length is 84 and actual length is 86 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene50_0.pos did not match. expected checksum is 2299588010 and actual is checksum 2299553846. expected length is 190 and actual length is 5717 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene54_0.dvm did not match. expected checksum is 914650440 and actual is checksum 2852383192. expected length is 312 and actual length is 548 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.nvm did not match. expected checksum is 3037735995 and actual is checksum 4023026424. expected length is 108 and actual length is 108 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene50_0.doc did not match. expected checksum is 3813274592 and actual is checksum 189237707. expected length is 110 and actual length is 1945 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz_Lucene50_0.tim did not match. expected checksum is 3013245878 and actual is checksum 2122722316. expected length is 1757 and actual length is 16642 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.fnm did not match. expected checksum is 2117653105 and actual is checksum 3401755804. expected length is 1265 and actual length is 1265 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnuz.si did not match. expected checksum is 2715978927 and actual is checksum 3653125964. expected length is 535 and actual length is 535 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.fdt did not match. expected checksum is 1699019853 and actual is checksum 3731775500. expected length is 15865 and actual length is 12728 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene50_0.pos did not match. expected checksum is 2189908204 and actual is checksum 2338139479. expected length is 4475 and actual length is 3431 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene50_0.doc did not match. expected checksum is 1522019614 and actual is checksum 969681917. expected length is 1394 and actual length is 1093 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene50_0.tim did not match. expected checksum is 813529901 and actual is checksum 529669468. expected length is 13843 and actual length is 12535 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.si did not match. expected checksum is 3802482417 and actual is checksum 1865633126. expected length is 535 and actual length is 535 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene54_0.dvm did not match. expected checksum is 4236057860 and actual is checksum 2986112802. expected length is 500 and actual length is 476 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.fdx did not match. expected checksum is 2497099401 and actual is checksum 990046808. expected length is 85 and actual length is 84 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.nvd did not match. expected checksum is 1736308969 and actual is checksum 3657480551. expected length is 73 and actual length is 71 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene50_0.tip did not match. expected checksum is 1362235492 and actual is checksum 640196019. expected length is 570 and actual length is 531 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.fnm did not match. expected checksum is 1975043794 and actual is checksum 1035049893. expected length is 1265 and actual length is 1265 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0.nvm did not match. expected checksum is 2985228383 and actual is checksum 2603407196. expected length is 108 and actual length is 108 9/1/2016, 12:37:07 PM WARN false IndexFetcher File _jnv0_Lucene54_0.dvd did not match. expected checksum is 762056409 and actual is checksum 1514176651. expected length is 228 and actual length is 211 9/1/2016, 12:37:09 PM WARN false UpdateLog Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000005 refcount=2} active=true starting pos=20222 9/1/2016, 12:37:10 PM WARN false UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=10 deletes=0 deleteByQuery=1 errors=0 positionOfStart=20222} 9/1/2016, 12:37:10 PM ERROR false RecoveryStrategy Could not publish as ACTIVE after succesful recovery 9/1/2016, 12:37:10 PM ERROR false RecoveryStrategy Recovery failed - trying again... (0) 9/1/2016, 12:37:37 PM WARN false UpdateLog Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000006 refcount=2} active=true starting pos=0 9/1/2016, 12:37:38 PM WARN false UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0 deleteByQuery=0 errors=0 positionOfStart=0} 9/1/2016, 12:37:41 PM WARN false RecoveryStrategy Stopping recovery for core=[transcribedReports_shard1_replica2] coreNodeName=[core_node14] 9/1/2016, 12:42:13 PM WARN false UpdateLog Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000007 refcount=2} active=true starting pos=1748 9/1/2016, 12:42:14 PM WARN false UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 deleteByQuery=0 errors=0 positionOfStart=1748} 9/1/2016, 12:42:14 PM ERROR false RecoveryStrategy Could not publish as ACTIVE after succesful recovery 9/1/2016, 12:42:14 PM ERROR false RecoveryStrategy Recovery failed - trying again... (0) 9/1/2016, 12:42:43 PM ERROR false RecoveryStrategy Could not publish as ACTIVE after succesful recovery 9/1/2016, 12:42:43 PM ERROR false RecoveryStrategy Recovery failed - trying again... (0) Jon Hawkesworth Software Developer [cid:image002.png@01D20470.2D2826A0] Hanley Road, Malvern, WR13 6NP. UK O: +44 (0) 1684 312313 jon.hawkeswo...@mmodal.com www.mmodal.com<http://www.medquist.com/> This electronic mail transmission contains confidential information intended only for the person(s) named. Any use, distribution, copying or disclosure by another person is strictly prohibited. If you are not the intended recipient of this e-mail, promptly delete it and all attachments.