Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a 
rolling restart yesterday.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 22, 2019, at 8:21 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Walter:
> 
> I have no idea what the root cause is here, this really shouldn’t happen. But 
> the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is 
> assigned similarly to a shard leader, the same election process happens. All 
> the election nodes are ephemeral ZK nodes.
> 
> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can 
> assign a preferred role of Overseer in those (rare) cases where there are so 
> many state changes for ZooKeeper that it’s advisable for them to run on a 
> dedicated machine.
> 
> Overseer assignment is automatic. This should work;
> 1> shut everything down, Solr and Zookeeper
> 2> start your ZooKeepers and let them all get in sync with each other
> 3> start your Solr nodes. It might take 3 minutes or more to bring up the 
> first Solr node, there’s up to a 180 second delay if leaders are not findable 
> easily.
> 
> That should cause Solr to elect an overseer, probably the first Solr node to 
> come up.
> 
> It _might_ work to bounce just one Solr node, seeing the Overseer election 
> queue empty it may elect itself. That said, the overseer election queue won’t 
> contain the rest of the Solr nodes like it should, so if that works you 
> should probably bounce the rest of the Solr servers one by one to restore the 
> proper election queue process.
> 
> Not a fix for the root cause of course, but should get things operating 
> again. I’ll add that I haven’t seen this happen in the field to my 
> recollection, if at all.
> 
> Best,
> Erick
> 
>> On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote:
>> 
>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>> 
>> Before blowing it away, you could try:
>> 
>> - id a candidate node, with a snapshot you just might think is old enough
>> to be robust.
>> - clean data for zk nodes otherwise.
>> - bring up the chosen node and wait for it to settle[wish i could remember
>> why i called what i saw that]
>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>> the new leader.
>> - they should each in turn request the snapshot from the lead. then you
>> have
>> 
>> : align your collections with the ensemble. and for the life of me i can't
>> remember there being anything particularly tricky about that with fusion ,
>> which means I can't remember what I did... or have it doc'd at home. ;-)
>> 
>> 
>> Will Martin
>> DEVOPS ENGINEER
>> 540.454.9565
>> 
>> 8609 WESTWOOD CENTER DR, SUITE 475
>> VIENNA, VA 22182
>> geturgently.com
>> 
>> 
>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> Yes, please. I have the logs from each of the Zookeepers.
>>> 
>>> We are running 3.4.12.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote:
>>>> 
>>>> Walter. Can I cross-post to zk-dev?
>>>> 
>>>> 
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> <urgently-email-logo>
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com <http://geturgently.com/>
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto:
>>> wmar...@urgent.ly>> wrote:
>>>>> 
>>>>> +1
>>>>> 
>>>>> Will Martin
>>>>> DEVOPS ENGINEER
>>>>> 540.454.9565
>>>>> 
>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>> VIENNA, VA 22182
>>>>> geturgently.com <http://geturgently.com/>
>>>>> 
>>>>> 
>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org
>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>> state for the cluster, so that is a pretty serious bug.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>> blog)
>>>>> 
>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org
>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>> 
>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>> 
>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>> blow this one away and rebuild.
>>>>>> 
>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>> failures across all three nodes.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>> (my blog)
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
> 

Reply via email to