We had a 6.6.2 prod cluster get into a state like this. It did not have an overseer, so any command just sat in the overseer queue. After I figured that out, I could see a bunch of queued stuff in the tree view under /overseer. That included an ADDROLE command to set an overseer. Sigh.
Fixed it by shutting down all the nodes, then bringing up one. That one realized there was no overseer and assumed the role. Then we brought up the rest of the nodes. I do not know how it got into that situation. We had some messed up networking conditions where I could HTTP from node A to port 8983 on node B, but it would hang when I tried that from B to A. This is all in AWS. Yours might be different. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 30, 2019, at 5:47 AM, Joe Obernberger <joseph.obernber...@gmail.com> > wrote: > > More info - looks like a zookeeper node got deleted somehow. > NoNode for > /collections/UNCLASS_30DAYS/leaders/shard31/leader > > I then made that node using solr zk mkroot, and now I get the error: > > :org.apache.solr.common.SolrException: Error getting leader from zk for shard > shard31 > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1299) > at org.apache.solr.cloud.ZkController.register(ZkController.java:1150) > at org.apache.solr.cloud.ZkController.register(ZkController.java:1081) > at > org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1346) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1310) > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1266) > ... 7 more > Caused by: java.lang.NullPointerException > at org.apache.solr.common.util.Utils.fromJSON(Utils.java:239) > at org.apache.solr.common.cloud.ZkNodeProps.load(ZkNodeProps.java:92) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1328) > ... 9 more > > Can I manually enter information for the leader? How would I get that? > > -Joe > > On 5/30/2019 8:39 AM, Joe Obernberger wrote: >> Hi All - I have a 40 node cluster that has been running great for a long >> while, but it all came down due to OOM. I adjusted the parameters and >> restarted, but one shard with 3 replicas (all NRT) will not elect a leader. >> I see messages like: >> >> 2019-05-30 12:35:30.597 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.c.SyncStrategy Sync replicas to >> http://elara:9100/solr/UNCLASS_30DAYS_shard31_replica_n182/ >> 2019-05-30 12:35:30.597 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 >> url=http://elara:9100/solr START >> replicas=[http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/, >> http://rosalind:9100/solr/UNCLASS_30DAYS_shard31_replica_n184/] nUpdates=100 >> 2019-05-30 12:35:30.651 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 >> url=http://elara:9100/solr Received 100 versions from >> http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/ >> fingerprint:null >> 2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 >> url=http://elara:9100/solr Our versions are too old. >> ourHighThreshold=1634891841359839232 otherLowThreshold=1634892098551414784 >> ourHighest=1634892003501146112 otherHighest=1634892708023631872 >> 2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 >> url=http://elara:9100/solr DONE. sync failed >> 2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to >> the next candidate >> 2019-05-30 12:35:30.683 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate >> than us - going back into recovery >> 2019-05-30 12:35:30.693 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS >> s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] >> o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader >> parent node, won't remove previous leader registration. >> 2019-05-30 12:35:30.694 WARN >> (updateExecutor-3-thread-4-processing-n:elara:9100_solr >> x:UNCLASS_30DAYS_shard31_replica_n182 c:UNCLASS_30DAYS s:shard31 >> r:core_node185) [c:UNCLASS_30DAYS s:shard31 r:core_node185 >> x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.RecoveryStrategy Stopping >> recovery for core=[UNCLASS_30DAYS_shard31_replica_n182] >> coreNodeName=[core_node185] >> >> and >> >> 2019-05-30 12:25:39.522 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS >> s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] >> o.a.s.c.ActionThrottle Throttling leader attempts - waiting for 136ms >> 2019-05-30 12:25:39.672 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS >> s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] >> o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with >> higher term participated in leader election >> 2019-05-30 12:25:39.672 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS >> s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] >> o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate >> than us - going back into recovery >> 2019-05-30 12:25:39.677 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS >> s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] >> o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader >> parent node, won't remove previous leader registration. >> >> and >> >> 2019-05-30 12:26:39.820 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS >> s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] >> o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with >> higher term participated in leader election >> 2019-05-30 12:26:39.820 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS >> s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] >> o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate >> than us - going back into recovery >> 2019-05-30 12:26:39.826 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS >> s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] >> o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader >> parent node, won't remove previous leader registration. >> >> I've tried FORCELEADER, but it had no effect. I also tried adding a shard, >> but that one didn't come up either. The index is on HDFS. >> >> Help! >> >> -Joe >>