Hello All,

I am running into frequent issue where the leader shard in solr cloud stays
active but does not acknowledge as "leader" . This brings down the other
replicas as they go into to recovery mode and eventually fail trying to
sync up.

The error seen in "solr.log" is below: { this also similar to what is
shared in this email thread (
https://www.mail-archive.com/solr-user@lucene.apache.org/msg127969.html) }

This has consumed lot of time but have not been able get any direction here
. Any help will be appreciated

Solr Version used : 5.5.2 { Comes packaged with HDP 2.5.3 }
The index are being stored on HDFS.

==error==

completed with
> http://node06.test.net:8984/solr/TEST_COLLECTION2_shard5_replica1/
> 2018-02-21 20:41:10.148 INFO
> (zkCallback-5-thread-4294-processing-n:node04.test.net:8984_solr)
> [c:TEST_COLLECTION2 s:shard5 r:core_node1
> 6 x:TEST_COLLECTION2_shard5_replica2] o.a.s.c.SyncStrategy
> http://node04.test.net:8984/solr/TEST_COLLECTION2_shard5_replica2/:  sync
> completed with
> http://node17.test.net:8984/solr/TEST_COLLECTION2_shard5_replica3/
> 2018-02-21 20:41:10.149 INFO
> (zkCallback-5-thread-4294-processing-n:node04.test.net:8984_solr)
> [c:TEST_COLLECTION2 s:shard5 r:core_node1
> 6 x:TEST_COLLECTION2_shard5_replica2]
> o.a.s.c.ShardLeaderElectionContextBase Creating leader registration node
> /collections/TEST_COLLECTION2/leaders/sh
> ard5/leader after winning as
> /collections/TEST_COLLECTION2/leader_elect/shard5/election/171270658970051676-core_node16-n_0000001784
> 2018-02-21 20:41:10.151 INFO
> (zkCallback-5-thread-4294-processing-n:node04.test.net:8984_solr)
> [c:TEST_COLLECTION2 s:shard5 r:core_node1
> 6 x:TEST_COLLECTION2_shard5_replica2] o.a.s.c.u.RetryUtil Retry due to
> Throwable, org.apache.zookeeper.KeeperException$NodeExistsException
> KeeperErrorCode
> = NodeExists
> 2018-02-21 20:41:10.498 ERROR
> (recoveryExecutor-3-thread-55-processing-s:shard10
> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION 
> n:node04.test.net:8984_solr
> r:core_node59) [c:TEST_COLLECTION s:shard10 r:core_node59
> x:TEST_COLLECTION_shard10_replica3] o.a.s.c.RecoveryStrategy Error while
> trying to recover.
> core=TEST_COLLECTION_shard10_replica3:org.apache.solr.common.SolrException:
> No registered leader was found after waiting for 4000ms , collection:
> TEST_COLLECTION slice: shard10
>         at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:626)
>         at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:612)
>         at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:306)
>         at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> 2018-02-21 20:41:10.498 INFO
> (recoveryExecutor-3-thread-55-processing-s:shard10
> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION 
> n:node04.test.net:8984_solr
> r:core_node59) [c:TEST_COLLECTION s:shard10 r:core_node59
> x:TEST_COLLECTION_shard10_replica3] o.a.s.c.RecoveryStrategy Replay not
> started, or was not successful... still buffering updates.
> 2018-02-21 20:41:10.498 ERROR
> (recoveryExecutor-3-thread-55-processing-s:shard10
> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION 
> n:node04.test.net:8984_solr
> r:core_node59) [c:TEST_COLLECTION s:shard10 r:core_node59
> x:TEST_COLLECTION_shard10_replica3] o.a.s.c.RecoveryStrategy Recovery
> failed - trying again... (0)
> 2018-02-21 20:41:10.498 INFO
> (recoveryExecutor-3-thread-55-processing-s:shard10
> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION 
> n:node04.test.net:8984_solr
> r:core_node59) [c:TEST_COLLECTION s:shard10 r:core_node59
> x:TEST_COLLECTION_shard10_replica3] o.a.s.c.RecoveryStrategy Wait [2.0]
> seconds before trying to recover again (attempt=1)
> 2018-02-21 20:41:10.928 INFO
> (zkCallback-5-thread-4295-processing-n:node04.test.net:8984_solr) [   ]
> o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
> state:SyncConnected type:NodeDataChanged
> path:/collections/TEST_COLLECTION3/state.json] for collection
> [TEST_COLLECTION3] has occurred - updating... (live nodes size: [17])
> 2018-02-21 20:41:10.928 INFO
> (zkCallback-5-thread-4293-processing-n:node04.test.net:8984_solr) [   ]
> o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent
> state:SyncConnected type:NodeDataChanged
> path:/collections/TEST_COLLECTION3/state.json] for collection
> [TEST_COLLECTION3] has occurred - updating... (live nodes size: [17])

Reply via email to