Thank you Walter.  I ended up dropping the collection.  We have two primary collections - one is all the data (100 shards, no replicas), and one is 30 days of data (40 shards, 3 replicas each).  We hardly ever have any issues with the collection with no replicas.  I tried bringing down the nodes several times.  I then updated the zookeeper node and put the necessary information into it with a leader selected.  Then I restarted the nodes again - no luck.

-Joe

On 5/30/2019 10:42 AM, Walter Underwood wrote:
We had a 6.6.2 prod cluster get into a state like this. It did not have an 
overseer, so any command just sat in the overseer queue. After I figured that 
out, I could see a bunch of queued stuff in the tree view under /overseer. That 
included an ADDROLE command to set an overseer. Sigh.

Fixed it by shutting down all the nodes, then bringing up one. That one 
realized there was no overseer and assumed the role. Then we brought up the 
rest of the nodes.

I do not know how it got into that situation. We had some messed up networking 
conditions where I could HTTP from node A to port 8983 on node B, but it would 
hang when I tried that from B to A. This is all in AWS.

Yours might be different.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On May 30, 2019, at 5:47 AM, Joe Obernberger <joseph.obernber...@gmail.com> 
wrote:

More info - looks like a zookeeper node got deleted somehow.
NoNode for
/collections/UNCLASS_30DAYS/leaders/shard31/leader

I then made that node using solr zk mkroot, and now I get the error:

:org.apache.solr.common.SolrException: Error getting leader from zk for shard 
shard31
     at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1299)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:1150)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:1081)
     at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187)
     at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader props
     at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1346)
     at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1310)
     at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1266)
     ... 7 more
Caused by: java.lang.NullPointerException
     at org.apache.solr.common.util.Utils.fromJSON(Utils.java:239)
     at org.apache.solr.common.cloud.ZkNodeProps.load(ZkNodeProps.java:92)
     at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1328)
     ... 9 more

Can I manually enter information for the leader? How would I get that?

-Joe

On 5/30/2019 8:39 AM, Joe Obernberger wrote:
Hi All - I have a 40 node cluster that has been running great for a long while, 
but it all came down due to OOM.  I adjusted the parameters and restarted, but 
one shard with 3 replicas (all NRT) will not elect a leader.  I see messages 
like:

2019-05-30 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.c.SyncStrategy Sync replicas to 
http://elara:9100/solr/UNCLASS_30DAYS_shard31_replica_n182/
2019-05-30 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 
url=http://elara:9100/solr START 
replicas=[http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/, 
http://rosalind:9100/solr/UNCLASS_30DAYS_shard31_replica_n184/] nUpdates=100
2019-05-30 12:35:30.651 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 
url=http://elara:9100/solr  Received 100 versions from 
http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/ fingerprint:null
2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 
url=http://elara:9100/solr  Our versions are too old. 
ourHighThreshold=1634891841359839232 otherLowThreshold=1634892098551414784 
ourHighest=1634892003501146112 otherHighest=1634892708023631872
2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 
url=http://elara:9100/solr DONE. sync failed
2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to the 
next candidate
2019-05-30 12:35:30.683 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than 
us - going back into recovery
2019-05-30 12:35:30.693 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS 
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] 
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader 
parent node, won't remove previous leader registration.
2019-05-30 12:35:30.694 WARN 
(updateExecutor-3-thread-4-processing-n:elara:9100_solr 
x:UNCLASS_30DAYS_shard31_replica_n182 c:UNCLASS_30DAYS s:shard31 
r:core_node185) [c:UNCLASS_30DAYS s:shard31 r:core_node185 
x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.RecoveryStrategy Stopping 
recovery for core=[UNCLASS_30DAYS_shard31_replica_n182] 
coreNodeName=[core_node185]

and

2019-05-30 12:25:39.522 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS 
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] 
o.a.s.c.ActionThrottle Throttling leader attempts - waiting for 136ms
2019-05-30 12:25:39.672 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS 
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] 
o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with 
higher term participated in leader election
2019-05-30 12:25:39.672 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS 
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] 
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than 
us - going back into recovery
2019-05-30 12:25:39.677 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS 
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] 
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader 
parent node, won't remove previous leader registration.

and

2019-05-30 12:26:39.820 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS 
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] 
o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with 
higher term participated in leader election
2019-05-30 12:26:39.820 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS 
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] 
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than 
us - going back into recovery
2019-05-30 12:26:39.826 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS 
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] 
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader 
parent node, won't remove previous leader registration.

I've tried FORCELEADER, but it had no effect.  I also tried adding a shard, but 
that one didn't come up either.  The index is on HDFS.

Help!

-Joe



---
This email has been checked for viruses by AVG.
https://www.avg.com

Reply via email to