Hi

If anyone has any suggestions of things I could try to resolve my issue where 
one replica on one of my solcloud 6.0.1 shards refuses to stay up, I'd love to 
hear them.  In fact, I'll get you something off your amazon wishlist, within 
reason, if you can solve this puzzle.

Today we pruned the dead replica, restarted the machine where it ran and once 
the node had rejoined the cluster, we added a new replica.
The replica was marked as Active for about 10 minutes then went down

I put some example logging from below, but it looks much the same as last time.

There's a bunch of warnings about a checksum being different even though the 
file size is the same and then RecoveryStrategy
reports 'Could not publish as ACTIVE after succesful recovery'

I think I've found where that message comes from in the code here: 
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java;h=abd00aef19a731b42b314f8b526cdb2d77baf89f;hb=refs/heads/master
(I am running 6.0.1 though so could have changed in latest devel).

So it seems this chunk of code...

451         if (successfulRecovery) {
452           LOG.info("Registering as Active after recovery.");
453           try {
454             zkController.publish(core.getCoreDescriptor(), 
Replica.State.ACTIVE);
455           } catch (Exception e) {
456             LOG.error("Could not publish as ACTIVE after succesful 
recovery", e);
457             successfulRecovery = false;
458           }
459
 460           if (successfulRecovery) {
461             close = true;
462             recoveryListener.recovered();
463           }
464         }

results in this:

org.apache.solr.common.SolrException: Cannot publish state of core 
'documents_shard1_replica2' as active without recovering first!
               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1141)
               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1097)
               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1093)
               at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:457)
               at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
               at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
Source)
               at java.util.concurrent.FutureTask.run(Unknown Source)
               at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
               at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
               at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
               at java.lang.Thread.run(Unknown Source)

I don't yet understand the interaction with zookeeper but there's some 
disagreement about whether recovery has happened or not (if it hadn't from 
solr's point of view the successfulRecovery boolean would presumably be false.

Should I raise a JIRA?  Is there any other useful information I could gather?

I haven't really had any similar problems with the other 3 shards, just shard1.

The nodes that it is running on are all pretty similar - all vms built to the 
same specification and the deployment of java and solrcloud is automated so 
there shouldn't be any differences in the stack.

Many thanks,

Jon




Example log output below


WARN false


IndexFetcher

File _jnux.si did not match. expected checksum is 1186898951 and actual is 
checksum 1994281621. expected length is 417 and actual length is 417
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy.nvd did not match. expected checksum is 2200422612 and actual is 
checksum 3635321041. expected length is 63 and actual length is 65
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy.fdx did not match. expected checksum is 281622189 and actual is 
checksum 838341528. expected length is 84 and actual length is 84
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy.nvm did not match. expected checksum is 1875012021 and actual is 
checksum 524812847. expected length is 108 and actual length is 108
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy.fnm did not match. expected checksum is 1681449973 and actual is 
checksum 3351426142. expected length is 1265 and actual length is 1265
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene54_0.dvm did not match. expected checksum is 355987228 and 
actual is checksum 847034886. expected length is 380 and actual length is 404
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene50_0.pos did not match. expected checksum is 806636274 and 
actual is checksum 2272195325. expected length is 1059 and actual length is 1172
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene50_0.doc did not match. expected checksum is 4041316671 and 
actual is checksum 3122885740. expected length is 212 and actual length is 281
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene50_0.tim did not match. expected checksum is 2891628412 and 
actual is checksum 2420913910. expected length is 5346 and actual length is 6251
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene50_0.tip did not match. expected checksum is 1652105503 and 
actual is checksum 807238796. expected length is 336 and actual length is 349
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy_Lucene54_0.dvd did not match. expected checksum is 2664049801 and 
actual is checksum 2930561414. expected length is 130 and actual length is 145
9/1/2016, 12:37:06 PM

WARN false


IndexFetcher

File _jnuy.fdt did not match. expected checksum is 4175958592 and actual is 
checksum 3650490510. expected length is 4280 and actual length is 4983
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuy.si did not match. expected checksum is 2223401636 and actual is 
checksum 734463570. expected length is 535 and actual length is 535
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene54_0.dvd did not match. expected checksum is 202072236 and 
actual is checksum 4194802930. expected length is 96 and actual length is 264
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene50_0.tip did not match. expected checksum is 2123658306 and 
actual is checksum 435878007. expected length is 298 and actual length is 639
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.nvd did not match. expected checksum is 4214748910 and actual is 
checksum 3784036105. expected length is 59 and actual length is 77
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.fdt did not match. expected checksum is 3837568601 and actual is 
checksum 2542454689. expected length is 896 and actual length is 20338
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.fdx did not match. expected checksum is 2070429440 and actual is 
checksum 3279752998. expected length is 84 and actual length is 86
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene50_0.pos did not match. expected checksum is 2299588010 and 
actual is checksum 2299553846. expected length is 190 and actual length is 5717
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene54_0.dvm did not match. expected checksum is 914650440 and 
actual is checksum 2852383192. expected length is 312 and actual length is 548
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.nvm did not match. expected checksum is 3037735995 and actual is 
checksum 4023026424. expected length is 108 and actual length is 108
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene50_0.doc did not match. expected checksum is 3813274592 and 
actual is checksum 189237707. expected length is 110 and actual length is 1945
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz_Lucene50_0.tim did not match. expected checksum is 3013245878 and 
actual is checksum 2122722316. expected length is 1757 and actual length is 
16642
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.fnm did not match. expected checksum is 2117653105 and actual is 
checksum 3401755804. expected length is 1265 and actual length is 1265
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnuz.si did not match. expected checksum is 2715978927 and actual is 
checksum 3653125964. expected length is 535 and actual length is 535
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.fdt did not match. expected checksum is 1699019853 and actual is 
checksum 3731775500. expected length is 15865 and actual length is 12728
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene50_0.pos did not match. expected checksum is 2189908204 and 
actual is checksum 2338139479. expected length is 4475 and actual length is 3431
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene50_0.doc did not match. expected checksum is 1522019614 and 
actual is checksum 969681917. expected length is 1394 and actual length is 1093
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene50_0.tim did not match. expected checksum is 813529901 and 
actual is checksum 529669468. expected length is 13843 and actual length is 
12535
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.si did not match. expected checksum is 3802482417 and actual is 
checksum 1865633126. expected length is 535 and actual length is 535
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene54_0.dvm did not match. expected checksum is 4236057860 and 
actual is checksum 2986112802. expected length is 500 and actual length is 476
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.fdx did not match. expected checksum is 2497099401 and actual is 
checksum 990046808. expected length is 85 and actual length is 84
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.nvd did not match. expected checksum is 1736308969 and actual is 
checksum 3657480551. expected length is 73 and actual length is 71
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene50_0.tip did not match. expected checksum is 1362235492 and 
actual is checksum 640196019. expected length is 570 and actual length is 531
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.fnm did not match. expected checksum is 1975043794 and actual is 
checksum 1035049893. expected length is 1265 and actual length is 1265
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0.nvm did not match. expected checksum is 2985228383 and actual is 
checksum 2603407196. expected length is 108 and actual length is 108
9/1/2016, 12:37:07 PM

WARN false


IndexFetcher

File _jnv0_Lucene54_0.dvd did not match. expected checksum is 762056409 and 
actual is checksum 1514176651. expected length is 228 and actual length is 211
9/1/2016, 12:37:09 PM

WARN false


UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000005
 refcount=2} active=true starting pos=20222
9/1/2016, 12:37:10 PM

WARN false


UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=10 deletes=0 
deleteByQuery=1 errors=0 positionOfStart=20222}
9/1/2016, 12:37:10 PM

ERROR false


RecoveryStrategy

Could not publish as ACTIVE after succesful recovery
9/1/2016, 12:37:10 PM

ERROR false


RecoveryStrategy

Recovery failed - trying again... (0)
9/1/2016, 12:37:37 PM

WARN false


UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000006
 refcount=2} active=true starting pos=0
9/1/2016, 12:37:38 PM

WARN false


UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}
9/1/2016, 12:37:41 PM

WARN false


RecoveryStrategy

Stopping recovery for core=[transcribedReports_shard1_replica2] 
coreNodeName=[core_node14]
9/1/2016, 12:42:13 PM

WARN false


UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000000000007
 refcount=2} active=true starting pos=1748
9/1/2016, 12:42:14 PM

WARN false


UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 
deleteByQuery=0 errors=0 positionOfStart=1748}
9/1/2016, 12:42:14 PM

ERROR false


RecoveryStrategy

Could not publish as ACTIVE after succesful recovery
9/1/2016, 12:42:14 PM

ERROR false


RecoveryStrategy

Recovery failed - trying again... (0)
9/1/2016, 12:42:43 PM

ERROR false


RecoveryStrategy

Could not publish as ACTIVE after succesful recovery
9/1/2016, 12:42:43 PM

ERROR false


RecoveryStrategy

Recovery failed - trying again... (0)

Jon Hawkesworth
Software Developer


[cid:image002.png@01D20470.2D2826A0]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkeswo...@mmodal.com
www.mmodal.com<http://www.medquist.com/>

This electronic mail transmission contains confidential information intended 
only for the person(s) named. Any use, distribution, copying or disclosure by 
another person is strictly prohibited. If you are not the intended recipient of 
this e-mail, promptly delete it and all attachments.

Reply via email to