Thank you Walter. I ended up dropping the collection. We have two
primary collections - one is all the data (100 shards, no replicas), and
one is 30 days of data (40 shards, 3 replicas each). We hardly ever
have any issues with the collection with no replicas. I tried bringing
down the nodes several times. I then updated the zookeeper node and put
the necessary information into it with a leader selected. Then I
restarted the nodes again - no luck.
-Joe
On 5/30/2019 10:42 AM, Walter Underwood wrote:
We had a 6.6.2 prod cluster get into a state like this. It did not have an
overseer, so any command just sat in the overseer queue. After I figured that
out, I could see a bunch of queued stuff in the tree view under /overseer. That
included an ADDROLE command to set an overseer. Sigh.
Fixed it by shutting down all the nodes, then bringing up one. That one
realized there was no overseer and assumed the role. Then we brought up the
rest of the nodes.
I do not know how it got into that situation. We had some messed up networking
conditions where I could HTTP from node A to port 8983 on node B, but it would
hang when I tried that from B to A. This is all in AWS.
Yours might be different.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
On May 30, 2019, at 5:47 AM, Joe Obernberger <joseph.obernber...@gmail.com>
wrote:
More info - looks like a zookeeper node got deleted somehow.
NoNode for
/collections/UNCLASS_30DAYS/leaders/shard31/leader
I then made that node using solr zk mkroot, and now I get the error:
:org.apache.solr.common.SolrException: Error getting leader from zk for shard
shard31
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1299)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1150)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1081)
at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1346)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1310)
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1266)
... 7 more
Caused by: java.lang.NullPointerException
at org.apache.solr.common.util.Utils.fromJSON(Utils.java:239)
at org.apache.solr.common.cloud.ZkNodeProps.load(ZkNodeProps.java:92)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1328)
... 9 more
Can I manually enter information for the leader? How would I get that?
-Joe
On 5/30/2019 8:39 AM, Joe Obernberger wrote:
Hi All - I have a 40 node cluster that has been running great for a long while,
but it all came down due to OOM. I adjusted the parameters and restarted, but
one shard with 3 replicas (all NRT) will not elect a leader. I see messages
like:
2019-05-30 12:35:30.597 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.c.SyncStrategy Sync replicas to
http://elara:9100/solr/UNCLASS_30DAYS_shard31_replica_n182/
2019-05-30 12:35:30.597 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182
url=http://elara:9100/solr START
replicas=[http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/,
http://rosalind:9100/solr/UNCLASS_30DAYS_shard31_replica_n184/] nUpdates=100
2019-05-30 12:35:30.651 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182
url=http://elara:9100/solr Received 100 versions from
http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/ fingerprint:null
2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182
url=http://elara:9100/solr Our versions are too old.
ourHighThreshold=1634891841359839232 otherLowThreshold=1634892098551414784
ourHighest=1634892003501146112 otherHighest=1634892708023631872
2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182
url=http://elara:9100/solr DONE. sync failed
2019-05-30 12:35:30.652 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to the
next candidate
2019-05-30 12:35:30.683 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than
us - going back into recovery
2019-05-30 12:35:30.693 INFO (zkCallback-7-thread-3) [c:UNCLASS_30DAYS
s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182]
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader
parent node, won't remove previous leader registration.
2019-05-30 12:35:30.694 WARN
(updateExecutor-3-thread-4-processing-n:elara:9100_solr
x:UNCLASS_30DAYS_shard31_replica_n182 c:UNCLASS_30DAYS s:shard31
r:core_node185) [c:UNCLASS_30DAYS s:shard31 r:core_node185
x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.RecoveryStrategy Stopping
recovery for core=[UNCLASS_30DAYS_shard31_replica_n182]
coreNodeName=[core_node185]
and
2019-05-30 12:25:39.522 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184]
o.a.s.c.ActionThrottle Throttling leader attempts - waiting for 136ms
2019-05-30 12:25:39.672 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184]
o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with
higher term participated in leader election
2019-05-30 12:25:39.672 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184]
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than
us - going back into recovery
2019-05-30 12:25:39.677 INFO (zkCallback-7-thread-1) [c:UNCLASS_30DAYS
s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184]
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader
parent node, won't remove previous leader registration.
and
2019-05-30 12:26:39.820 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180]
o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with
higher term participated in leader election
2019-05-30 12:26:39.820 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180]
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than
us - going back into recovery
2019-05-30 12:26:39.826 INFO (zkCallback-7-thread-5) [c:UNCLASS_30DAYS
s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180]
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader
parent node, won't remove previous leader registration.
I've tried FORCELEADER, but it had no effect. I also tried adding a shard, but
that one didn't come up either. The index is on HDFS.
Help!
-Joe
---
This email has been checked for viruses by AVG.
https://www.avg.com