RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Jon Hawkesworth Wed, 31 Aug 2016 07:56:05 -0700

For what it's worth we have basic auth enabled on our cluster, just in case 
that might make a difference, although it seems unlikely as it seems to 
something about shard1 as the other shards appear healthy.

Jon

-----Original Message-----
From: Jon Hawkesworth 
Sent: Wednesday, August 31, 2016 3:14 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down

A little more information - I spotted this in the logging:

Error while trying to recover. 
core=documents_shard1_replica2:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.201.19.8:8983/solr: I was asked to wait on state 
recovering for shard1 in documents on 10.201.19.9:8983_solr but I still do not 
see the requested state. I see state: down live:true leader from ZK: 
http://10.201.19.8:8983/solr/documents_shard1_replica1/

        at java.util.concurrent.FutureTask.report(Unknown Source)

        at java.util.concurrent.FutureTask.get(Unknown Source)

        at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)

        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)

        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)

        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)

        at java.util.concurrent.FutureTask.run(Unknown Source)

        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

        at java.lang.Thread.run(Unknown Source)

Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.201.19.8:8983/solr: I was asked to wait on state 
recovering for shard1 in documents on 10.201.19.9:8983_solr but I still do not 
see the requested state. I see state: down live:true leader from ZK: 
http://10.201.19.8:8983/solr/documents_shard1_replica1/

        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)

        at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)

        at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)

        ... 5 more

-----Original Message-----
From: Jon Hawkesworth [mailto:jon.hawkeswo...@medquist.onmicrosoft.com] 
Sent: Wednesday, August 31, 2016 2:09 PM
To: solr-user@lucene.apache.org
Subject: RE: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down

Just to follow up on this.  Sorry its taken a while but need for a downtime 
window has slowed things down somewhat.
So the index checker didn't find any errors on either replica of the shard 
which refused to come out of Down state.

By deleting the down replica, stopping all the solr nodes, deleting the 
initiated recovery node from zookeeper for the affected shard that, restarting 
the solr nodes cluster, and creating a new replica we were able to get a fully 
recovered cluster for a few minutes.

However, as soon as we started indexing again, the shard1 replica went into the 
down state again.  We stopped indexing temporarily and eventually the replica 
recovered (I don't now recall if we also restarted the node that was hosting 
the down replica as well).

We have had to turn indexing on again as our downtime window was closing and 
that node is still stuck in the down state, with the following kind of errors 
appearing the logs for the node (see below).

If anyone has any further suggestions I would be very grateful to hear them.

Some more info about our system in case it is relevant

The ingestion program is a .net exe which uses SolrNet to send documents to 
solr  via a load balancer.
The solr 6.0.1 nodes are running on Windows server 2012r2

We discovered that the load balancer had been sending documents to index to 
just 2 out of the 4 nodes.  That has been resolved but still one replica 
remains Down 2 hours later.

Jon

Error log sample:

Log4j (org.slf4j.impl.Log4jLoggerFactory)
Time (Local)        Level      Core      Logger Message
8/31/2016, 1:06:22 PM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000002920482
 refcount=2} active=true starting pos=0 8/31/2016, 1:06:22 PM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=14 deletes=0 
deleteByQuery=0 errors=0 positionOfStart=0} 8/31/2016, 1:06:30 PM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000002920483
 refcount=2} active=true starting pos=0 8/31/2016, 1:06:31 PM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=79 deletes=0 
deleteByQuery=7 errors=0 positionOfStart=0} 8/31/2016, 1:06:42 PM

WARN false

RecoveryStrategy

Stopping recovery for core=[transcribedReports_shard1_replica2] 
coreNodeName=[core_node7] 8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene50_0.pos did not match. expected checksum is 3474133002 and 
actual is checksum 1597775173. expected length is 6754 and actual length is 
6754 8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh.nvm did not match. expected checksum is 2175070547 and actual is 
checksum 916368144. expected length is 108 and actual length is 108 8/31/2016, 
1:10:41 PM

WARN false

IndexFetcher

File _jhqh.fnm did not match. expected checksum is 699955709 and actual is 
checksum 1393673372. expected length is 1265 and actual length is 1265 
8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh.fdx did not match. expected checksum is 3794586996 and actual is 
checksum 3444689798. expected length is 86 and actual length is 86 8/31/2016, 
1:10:41 PM

WARN false

IndexFetcher

File _jhqh.nvd did not match. expected checksum is 481052979 and actual is 
checksum 478734907. expected length is 101 and actual length is 101 8/31/2016, 
1:10:41 PM

WARN false

IndexFetcher

File _jhqh.fdt did not match. expected checksum is 959677932 and actual is 
checksum 3014384448. expected length is 24686 and actual length is 24686 
8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh.si did not match. expected checksum is 3370845336 and actual is 
checksum 2875338275. expected length is 535 and actual length is 535 8/31/2016, 
1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene54_0.dvd did not match. expected checksum is 3692441424 and 
actual is checksum 827877016. expected length is 374 and actual length is 374 
8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene50_0.tip did not match. expected checksum is 4147813022 and 
actual is checksum 1614500446. expected length is 702 and actual length is 702 
8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene50_0.tim did not match. expected checksum is 2595739732 and 
actual is checksum 908816165. expected length is 19898 and actual length is 
19898 8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene54_0.dvm did not match. expected checksum is 3783736439 and 
actual is checksum 515112915. expected length is 668 and actual length is 668 
8/31/2016, 1:10:41 PM

WARN false

IndexFetcher

File _jhqh_Lucene50_0.doc did not match. expected checksum is 711973339 and 
actual is checksum 3741904955. expected length is 2606 and actual length is 
2606 8/31/2016, 1:10:42 PM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica2\data\tlog\tlog.0000000000002920484
 refcount=2} active=true starting pos=4153 8/31/2016, 1:10:42 PM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=17 deletes=0 
deleteByQuery=1 errors=0 positionOfStart=4153} 8/31/2016, 1:10:42 PM

ERROR true

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

org.apache.solr.common.SolrException: Cannot publish state of core 
'transcribedReports_shard1_replica2' as active without recovering first!

               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1141)

               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1097)

               at 
org.apache.solr.cloud.ZkController.publish(ZkController.java:1093)

               at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:457)

               at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)

               at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
Source)

               at java.util.concurrent.FutureTask.run(Unknown Source)

               at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)

               at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)

               at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)

               at java.lang.Thread.run(Unknown Source)

From: Jon Hawkesworth [mailto:jon.hawkeswo...@medquist.onmicrosoft.com]
Sent: Friday, August 26, 2016 6:48 AM
To: solr-user@lucene.apache.org
Subject: RE: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down

Thanks for your suggestion.  Here's a chunk of info from the logging in the 
solr admin page below.  Is there somewhere else I should be looking too?

It looks to me like its stuck in  a never-ending loop of attempting recovery 
that fails.

I don't know if the warnings from IndexFetcher are relevant or not, and if they 
are, what I can do about them?

Our system has been feeding 150k docs a day into this cluster for nearly two 
months now.  I have a backlog of approx 45million more documents I need to get 
loaded but until I have a healthy looking cluster it would be foolish to start 
loading even more.

Jon

Time (Local)

Level

Core

Logger

Message

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvm did not match. expected checksum is 1754812894 and actual is 
checksum 3450541029. expected length is 108 and actual length is 108

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fnm did not match. expected checksum is 2714900770 and actual is 
checksum 1393668596. expected length is 1265 and actual length is 1265

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.doc did not match. expected checksum is 1374818988 and 
actual is checksum 1039421217. expected length is 110 and actual length is 433

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tim did not match. expected checksum is 1001343351 and 
actual is checksum 3395571641. expected length is 2025 and actual length is 7662

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tip did not match. expected checksum is 814607015 and 
actual is checksum 1271109784. expected length is 301 and actual length is 421

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene54_0.dvd did not match. expected checksum is 875968405 and 
actual is checksum 4024097898. expected length is 96 and actual length is 144

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.si did not match. expected checksum is 2341973651 and actual is 
checksum 281320882. expected length is 535 and actual length is 535

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdx did not match. expected checksum is 2874533507 and actual is 
checksum 3545673052. expected length is 84 and actual length is 84

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvd did not match. expected checksum is 663721296 and actual is 
checksum 1107475498. expected length is 59 and actual length is 68

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdt did not match. expected checksum is 2953417110 and actual is 
checksum 471758721. expected length is 1109 and actual length is 7185

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File segments_h7g8 did not match. expected checksum is 2040860271 and actual is 
checksum 1873911116. expected length is 2056 and actual length is 1926

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000321
 refcount=2} active=true starting pos=0

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 
deleteByQuery=0 errors=0 positionOfStart=0}

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Recovery failed - trying again... (0)

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000322
 refcount=2} active=true starting pos=0

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdt did not match. expected checksum is 4059848174 and actual is 
checksum 4234063128. expected length is 3060 and actual length is 1772

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdx did not match. expected checksum is 2421590578 and actual is 
checksum 1492609115. expected length is 84 and actual length is 84

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene54_0.dvd did not match. expected checksum is 2898024557 and 
actual is checksum 3762900089. expected length is 99 and actual length is 97

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.si did not match. expected checksum is 730964774 and actual is 
checksum 1292368805. expected length is 535 and actual length is 535

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvd did not match. expected checksum is 2920743481 and actual is 
checksum 2869652522. expected length is 59 and actual length is 59

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvm did not match. expected checksum is 328126313 and actual is 
checksum 1484623710. expected length is 108 and actual length is 108

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene54_0.dvm did not match. expected checksum is 3300364001 and 
actual is checksum 3819493713. expected length is 312 and actual length is 312

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.pos did not match. expected checksum is 93672274 and 
actual is checksum 1261080786. expected length is 845 and actual length is 403

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fnm did not match. expected checksum is 3818945769 and actual is 
checksum 2677014577. expected length is 1265 and actual length is 1265

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.doc did not match. expected checksum is 2160820791 and 
actual is checksum 3998191027. expected length is 110 and actual length is 110

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.tim did not match. expected checksum is 2334039520 and 
actual is checksum 2647062877. expected length is 3923 and actual length is 2573

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.tip did not match. expected checksum is 3632944779 and 
actual is checksum 1632973124. expected length is 325 and actual length is 304

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File segments_h7gk did not match. expected checksum is 633644562 and actual is 
checksum 1552102209. expected length is 2121 and actual length is 2056

8/26/2016, 6:22:13 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000323
 refcount=2} active=true starting pos=0

8/26/2016, 6:22:13 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}

8/26/2016, 6:22:13 AM

ERROR false

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

8/26/2016, 6:22:13 AM

ERROR false

RecoveryStrategy

Recovery failed - trying again... (0)

8/26/2016, 6:22:32 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.00

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Thursday, August 25, 2016 10:35 PM
To: solr-user
Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down

This is odd. The ADDREPLICA _should_ be immediately listed as "down", but 
should shortly go to "recovering"and then "active". The transition to "active" 
may take a while as the index has to be copied from the leader, but you 
shouldn't be stuck at "down" for very long.

Take a look at the Solr logs for both the leader of the shard and the replica 
you're trying to add. They often have more complete and helpful error 
messages...

Also note that you occasionally have to be patient. For instance, there's a

3 minute wait period for

leader election at times. It sounds, though, like things aren't getting better 
for far longer than 3 minutes.

Best,

Erick

On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth < 
jon.hawkeswo...@medquist.onmicrosoft.com<mailto:jon.hawkeswo...@medquist.onmicrosoft.com>>
 wrote:

> Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica

> remains down issue?

>

>

>

> Today we stopped all the loading and querying, brought down all 4 solr

> nodes, went into zookeeper and deleted everything under /collections/

> transcribedReports/leader_initiated_recovery/shard1/ and brought the

> cluster back up (this seeming to be a reasonably similar situation to

> https://issues.apache.org/jira/browse/SOLR-7021 where this workaround

> is described, albeit for an older version of solr.

>

>

>

> After a while things looked ok but when we attempted to move the

> second replica back to the original node (by creating a third and then

> deleting the temp one which wasn't on the node we wanted it on), we

> immediately got a 'down' status on the node (and its stayed that way

> ever since), with ' Could not publish as ACTIVE after succesful

> recovery ' messages appearing in the logs

>

>

>

> Its as if there is something specifically wrong with that node that

> stops us from ever having a functioning replica of shard1 on it.

>

>

>

> weird thing is shard2 on the same (problematic) node seems fine.

>

>

>

> Other stuff we have tried includes

>

>

>

> issuing a REQUESTRECOVERY

>

> moving from 2 to 4 nodes

>

> adding more replicas on other nodes (new replicas immediately go into

> down state and stay that way).

>

>

>

> System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection

> with 4 shards and and I'm trying to have 2 replicas on each of the 4 nodes.

>

> Currently each shard is managing approx 1.2 million docs (mostly just

> text 10-20k in size each usually).

>

>

>

> Any suggestions would be gratefully appreciated.

>

>

>

> Many thanks,

>

>

>

> Jon

>

>

>

>

>

> *Jon Hawkesworth*

> Software Developer

>

>

>

>

>

> Hanley Road, Malvern, WR13 6NP. UK

>

> O: +44 (0) 1684 312313

>

> *jon.hawkeswo...@mmodal.com<mailto:*jon.hawkeswo...@mmodal.com> 
> <jon.hawkeswo...@mmodal.com<mailto:jon.hawkeswo...@mmodal.com>>

> www.mmodal.com<http://www.mmodal.com>

> <http://www.medquist.com/>*

>

>

>

> *This electronic mail transmission contains confidential information

> intended only for the person(s) named. Any use, distribution, copying

> or disclosure by another person is strictly prohibited. If you are not

> the intended recipient of this e-mail, promptly delete it and all

> attachments.*

>

>

>

Jon Hawkesworth
Software Developer

[cid:image002.png@01D1FF65.D5875150]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkeswo...@mmodal.com<mailto:jon.hawkeswo...@mmodal.com>
www.mmodal.com<http://www.medquist.com/>

This electronic mail transmission contains confidential information intended 
only for the person(s) named. Any use, distribution, copying or disclosure by 
another person is strictly prohibited. If you are not the intended recipient of 
this e-mail, promptly delete it and all attachments.

RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Reply via email to