Re: Cluster with no overseer?

Erick Erickson Wed, 22 May 2019 08:54:14 -0700

Good luck, this kind of assumes that your ZK ensemble is healthy of course...


> On May 22, 2019, at 8:23 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did 
> a rolling restart yesterday.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On May 22, 2019, at 8:21 AM, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Walter:
>> 
>> I have no idea what the root cause is here, this really shouldn’t happen. 
>> But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is 
>> assigned similarly to a shard leader, the same election process happens. All 
>> the election nodes are ephemeral ZK nodes.
>> 
>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can 
>> assign a preferred role of Overseer in those (rare) cases where there are so 
>> many state changes for ZooKeeper that it’s advisable for them to run on a 
>> dedicated machine.
>> 
>> Overseer assignment is automatic. This should work;
>> 1> shut everything down, Solr and Zookeeper
>> 2> start your ZooKeepers and let them all get in sync with each other
>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the 
>> first Solr node, there’s up to a 180 second delay if leaders are not 
>> findable easily.
>> 
>> That should cause Solr to elect an overseer, probably the first Solr node to 
>> come up.
>> 
>> It _might_ work to bounce just one Solr node, seeing the Overseer election 
>> queue empty it may elect itself. That said, the overseer election queue 
>> won’t contain the rest of the Solr nodes like it should, so if that works 
>> you should probably bounce the rest of the Solr servers one by one to 
>> restore the proper election queue process.
>> 
>> Not a fix for the root cause of course, but should get things operating 
>> again. I’ll add that I haven’t seen this happen in the field to my 
>> recollection, if at all.
>> 
>> Best,
>> Erick
>> 
>>> On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote:
>>> 
>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>> 
>>> Before blowing it away, you could try:
>>> 
>>> - id a candidate node, with a snapshot you just might think is old enough
>>> to be robust.
>>> - clean data for zk nodes otherwise.
>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>> why i called what i saw that]
>>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>>> the new leader.
>>> - they should each in turn request the snapshot from the lead. then you
>>> have
>>> 
>>> : align your collections with the ensemble. and for the life of me i can't
>>> remember there being anything particularly tricky about that with fusion ,
>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>> 
>>> 
>>> Will Martin
>>> DEVOPS ENGINEER
>>> 540.454.9565
>>> 
>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>> VIENNA, VA 22182
>>> geturgently.com
>>> 
>>> 
>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org>
>>> wrote:
>>> 
>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>> 
>>>> We are running 3.4.12.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote:
>>>>> 
>>>>> Walter. Can I cross-post to zk-dev?
>>>>> 
>>>>> 
>>>>> 
>>>>> Will Martin
>>>>> DEVOPS ENGINEER
>>>>> 540.454.9565
>>>>> 
>>>>> <urgently-email-logo>
>>>>> 
>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>> VIENNA, VA 22182
>>>>> geturgently.com <http://geturgently.com/>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto:
>>>> wmar...@urgent.ly>> wrote:
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> Will Martin
>>>>>> DEVOPS ENGINEER
>>>>>> 540.454.9565
>>>>>> 
>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>> VIENNA, VA 22182
>>>>>> geturgently.com <http://geturgently.com/>
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org
>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>> state for the cluster, so that is a pretty serious bug.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>>> blog)
>>>>>> 
>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org
>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>> 
>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>> 
>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>>> blow this one away and rebuild.
>>>>>>> 
>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>> failures across all three nodes.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>> (my blog)
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Cluster with no overseer?

Reply via email to