Good luck, this kind of assumes that your ZK ensemble is healthy of course...
> On May 22, 2019, at 8:23 AM, Walter Underwood <wun...@wunderwood.org> wrote: > > Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did > a rolling restart yesterday. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On May 22, 2019, at 8:21 AM, Erick Erickson <erickerick...@gmail.com> wrote: >> >> Walter: >> >> I have no idea what the root cause is here, this really shouldn’t happen. >> But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is >> assigned similarly to a shard leader, the same election process happens. All >> the election nodes are ephemeral ZK nodes. >> >> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can >> assign a preferred role of Overseer in those (rare) cases where there are so >> many state changes for ZooKeeper that it’s advisable for them to run on a >> dedicated machine. >> >> Overseer assignment is automatic. This should work; >> 1> shut everything down, Solr and Zookeeper >> 2> start your ZooKeepers and let them all get in sync with each other >> 3> start your Solr nodes. It might take 3 minutes or more to bring up the >> first Solr node, there’s up to a 180 second delay if leaders are not >> findable easily. >> >> That should cause Solr to elect an overseer, probably the first Solr node to >> come up. >> >> It _might_ work to bounce just one Solr node, seeing the Overseer election >> queue empty it may elect itself. That said, the overseer election queue >> won’t contain the rest of the Solr nodes like it should, so if that works >> you should probably bounce the rest of the Solr servers one by one to >> restore the proper election queue process. >> >> Not a fix for the root cause of course, but should get things operating >> again. I’ll add that I haven’t seen this happen in the field to my >> recollection, if at all. >> >> Best, >> Erick >> >>> On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote: >>> >>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role. >>> >>> Before blowing it away, you could try: >>> >>> - id a candidate node, with a snapshot you just might think is old enough >>> to be robust. >>> - clean data for zk nodes otherwise. >>> - bring up the chosen node and wait for it to settle[wish i could remember >>> why i called what i saw that] >>> - bring up other nodes 1 at a time. let each one fully sync to follower of >>> the new leader. >>> - they should each in turn request the snapshot from the lead. then you >>> have >>> >>> : align your collections with the ensemble. and for the life of me i can't >>> remember there being anything particularly tricky about that with fusion , >>> which means I can't remember what I did... or have it doc'd at home. ;-) >>> >>> >>> Will Martin >>> DEVOPS ENGINEER >>> 540.454.9565 >>> >>> 8609 WESTWOOD CENTER DR, SUITE 475 >>> VIENNA, VA 22182 >>> geturgently.com >>> >>> >>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org> >>> wrote: >>> >>>> Yes, please. I have the logs from each of the Zookeepers. >>>> >>>> We are running 3.4.12. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote: >>>>> >>>>> Walter. Can I cross-post to zk-dev? >>>>> >>>>> >>>>> >>>>> Will Martin >>>>> DEVOPS ENGINEER >>>>> 540.454.9565 >>>>> >>>>> <urgently-email-logo> >>>>> >>>>> 8609 WESTWOOD CENTER DR, SUITE 475 >>>>> VIENNA, VA 22182 >>>>> geturgently.com <http://geturgently.com/> >>>>> >>>>> >>>>> >>>>> >>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto: >>>> wmar...@urgent.ly>> wrote: >>>>>> >>>>>> +1 >>>>>> >>>>>> Will Martin >>>>>> DEVOPS ENGINEER >>>>>> 540.454.9565 >>>>>> >>>>>> 8609 WESTWOOD CENTER DR, SUITE 475 >>>>>> VIENNA, VA 22182 >>>>>> geturgently.com <http://geturgently.com/> >>>>>> >>>>>> >>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org >>>> <mailto:wun...@wunderwood.org>> wrote: >>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable >>>> state for the cluster, so that is a pretty serious bug. >>>>>> >>>>>> wunder >>>>>> Walter Underwood >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my >>>> blog) >>>>>> >>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org >>>> <mailto:wun...@wunderwood.org>> wrote: >>>>>>> >>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In >>>> /overseer_elect on ZK, there is an election folder, but no leader document. >>>> An OVERSEERSTATUS request fails with a timeout. >>>>>>> >>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other >>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can >>>> blow this one away and rebuild. >>>>>>> >>>>>>> Looking at the Zookeeper logs, I see a few instances of network >>>> failures across all three nodes. >>>>>>> >>>>>>> wunder >>>>>>> Walter Underwood >>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> >>>> (my blog) >>>>>>> >>>>>> >>>>> >>>> >>>> >> >