Walter: I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes.
Solr’s Overseer is _not_ fixed to a particular Solr node, although you can assign a preferred role of Overseer in those (rare) cases where there are so many state changes for ZooKeeper that it’s advisable for them to run on a dedicated machine. Overseer assignment is automatic. This should work; 1> shut everything down, Solr and Zookeeper 2> start your ZooKeepers and let them all get in sync with each other 3> start your Solr nodes. It might take 3 minutes or more to bring up the first Solr node, there’s up to a 180 second delay if leaders are not findable easily. That should cause Solr to elect an overseer, probably the first Solr node to come up. It _might_ work to bounce just one Solr node, seeing the Overseer election queue empty it may elect itself. That said, the overseer election queue won’t contain the rest of the Solr nodes like it should, so if that works you should probably bounce the rest of the Solr servers one by one to restore the proper election queue process. Not a fix for the root cause of course, but should get things operating again. I’ll add that I haven’t seen this happen in the field to my recollection, if at all. Best, Erick > On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote: > > Worked with Fusion and Zookeeper at GSA for 18 months: admin role. > > Before blowing it away, you could try: > > - id a candidate node, with a snapshot you just might think is old enough > to be robust. > - clean data for zk nodes otherwise. > - bring up the chosen node and wait for it to settle[wish i could remember > why i called what i saw that] > - bring up other nodes 1 at a time. let each one fully sync to follower of > the new leader. > - they should each in turn request the snapshot from the lead. then you > have > > : align your collections with the ensemble. and for the life of me i can't > remember there being anything particularly tricky about that with fusion , > which means I can't remember what I did... or have it doc'd at home. ;-) > > > Will Martin > DEVOPS ENGINEER > 540.454.9565 > > 8609 WESTWOOD CENTER DR, SUITE 475 > VIENNA, VA 22182 > geturgently.com > > > On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org> > wrote: > >> Yes, please. I have the logs from each of the Zookeepers. >> >> We are running 3.4.12. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote: >>> >>> Walter. Can I cross-post to zk-dev? >>> >>> >>> >>> Will Martin >>> DEVOPS ENGINEER >>> 540.454.9565 >>> >>> <urgently-email-logo> >>> >>> 8609 WESTWOOD CENTER DR, SUITE 475 >>> VIENNA, VA 22182 >>> geturgently.com <http://geturgently.com/> >>> >>> >>> >>> >>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto: >> wmar...@urgent.ly>> wrote: >>>> >>>> +1 >>>> >>>> Will Martin >>>> DEVOPS ENGINEER >>>> 540.454.9565 >>>> >>>> 8609 WESTWOOD CENTER DR, SUITE 475 >>>> VIENNA, VA 22182 >>>> geturgently.com <http://geturgently.com/> >>>> >>>> >>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org >> <mailto:wun...@wunderwood.org>> wrote: >>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable >> state for the cluster, so that is a pretty serious bug. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my >> blog) >>>> >>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org >> <mailto:wun...@wunderwood.org>> wrote: >>>>> >>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In >> /overseer_elect on ZK, there is an election folder, but no leader document. >> An OVERSEERSTATUS request fails with a timeout. >>>>> >>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other >> ideas. We’ve diverted all the traffic to the backing cluster, so we can >> blow this one away and rebuild. >>>>> >>>>> Looking at the Zookeeper logs, I see a few instances of network >> failures across all three nodes. >>>>> >>>>> wunder >>>>> Walter Underwood >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> >> (my blog) >>>>> >>>> >>> >> >>