Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 22, 2019, at 8:21 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > Walter: > > I have no idea what the root cause is here, this really shouldn’t happen. But > the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is > assigned similarly to a shard leader, the same election process happens. All > the election nodes are ephemeral ZK nodes. > > Solr’s Overseer is _not_ fixed to a particular Solr node, although you can > assign a preferred role of Overseer in those (rare) cases where there are so > many state changes for ZooKeeper that it’s advisable for them to run on a > dedicated machine. > > Overseer assignment is automatic. This should work; > 1> shut everything down, Solr and Zookeeper > 2> start your ZooKeepers and let them all get in sync with each other > 3> start your Solr nodes. It might take 3 minutes or more to bring up the > first Solr node, there’s up to a 180 second delay if leaders are not findable > easily. > > That should cause Solr to elect an overseer, probably the first Solr node to > come up. > > It _might_ work to bounce just one Solr node, seeing the Overseer election > queue empty it may elect itself. That said, the overseer election queue won’t > contain the rest of the Solr nodes like it should, so if that works you > should probably bounce the rest of the Solr servers one by one to restore the > proper election queue process. > > Not a fix for the root cause of course, but should get things operating > again. I’ll add that I haven’t seen this happen in the field to my > recollection, if at all. > > Best, > Erick > >> On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote: >> >> Worked with Fusion and Zookeeper at GSA for 18 months: admin role. >> >> Before blowing it away, you could try: >> >> - id a candidate node, with a snapshot you just might think is old enough >> to be robust. >> - clean data for zk nodes otherwise. >> - bring up the chosen node and wait for it to settle[wish i could remember >> why i called what i saw that] >> - bring up other nodes 1 at a time. let each one fully sync to follower of >> the new leader. >> - they should each in turn request the snapshot from the lead. then you >> have >> >> : align your collections with the ensemble. and for the life of me i can't >> remember there being anything particularly tricky about that with fusion , >> which means I can't remember what I did... or have it doc'd at home. ;-) >> >> >> Will Martin >> DEVOPS ENGINEER >> 540.454.9565 >> >> 8609 WESTWOOD CENTER DR, SUITE 475 >> VIENNA, VA 22182 >> geturgently.com >> >> >> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org> >> wrote: >> >>> Yes, please. I have the logs from each of the Zookeepers. >>> >>> We are running 3.4.12. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote: >>>> >>>> Walter. Can I cross-post to zk-dev? >>>> >>>> >>>> >>>> Will Martin >>>> DEVOPS ENGINEER >>>> 540.454.9565 >>>> >>>> <urgently-email-logo> >>>> >>>> 8609 WESTWOOD CENTER DR, SUITE 475 >>>> VIENNA, VA 22182 >>>> geturgently.com <http://geturgently.com/> >>>> >>>> >>>> >>>> >>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto: >>> wmar...@urgent.ly>> wrote: >>>>> >>>>> +1 >>>>> >>>>> Will Martin >>>>> DEVOPS ENGINEER >>>>> 540.454.9565 >>>>> >>>>> 8609 WESTWOOD CENTER DR, SUITE 475 >>>>> VIENNA, VA 22182 >>>>> geturgently.com <http://geturgently.com/> >>>>> >>>>> >>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org >>> <mailto:wun...@wunderwood.org>> wrote: >>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable >>> state for the cluster, so that is a pretty serious bug. >>>>> >>>>> wunder >>>>> Walter Underwood >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my >>> blog) >>>>> >>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org >>> <mailto:wun...@wunderwood.org>> wrote: >>>>>> >>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In >>> /overseer_elect on ZK, there is an election folder, but no leader document. >>> An OVERSEERSTATUS request fails with a timeout. >>>>>> >>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other >>> ideas. We’ve diverted all the traffic to the backing cluster, so we can >>> blow this one away and rebuild. >>>>>> >>>>>> Looking at the Zookeeper logs, I see a few instances of network >>> failures across all three nodes. >>>>>> >>>>>> wunder >>>>>> Walter Underwood >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> >>> (my blog) >>>>>> >>>>> >>>> >>> >>> >