Thanks for the response. I have never noticed any inconsistencies with 
zookeeper and the leader knows it has two followers (of course this is stats 
from now, I wish I had looked at this before the restarts)

#echo mntr | nc 127.0.0.1 2181
zk_version      3.4.8--1, built on 02/06/2016 03:18 GMT
…
zk_server_state leader
..
zk_followers    2
zk_synced_followers     2
..

Also the solr nodes are connecting effectively with the different ZK nodes:
(one box for example)
#echo stat | nc 127.0.0.1 2181
Zookeeper version: 3.4.8--1, built on 02/06/2016 03:18 GMT
Clients:
 /192.91.6.40:45450[1](queued=0,recved=598765,sent=601713)
 /192.91.6.27:50060[1](queued=0,recved=595002,sent=598030)
 /192.91.6.204:39008[1](queued=0,recved=625580,sent=628276)
 /127.0.0.1:38748[0](queued=0,recved=1,sent=0)

Each ZK node lists one or more of the solr boxes as clients.

Cheers,
Brendan

On 28/11/16, 12:11 pm, "Erick Erickson" <erickerick...@gmail.com> wrote:

    This seems very weird. Do the Zookeepers know about each other
    correctly? Some evidence for mis-configured Zookeepers might be if you
    rebooted ZK3 and had this happen again.
    
    But that's a wild shot in the dark.
    
    Best,
    Erick
    
    On Sun, Nov 27, 2016 at 4:42 PM, The Dobbo <ozb...@gmail.com> wrote:
    > Hi,
    > I have a 3 external node ZK (zookeeper-3.4.8) cluster managing my 6 node 
solrcloud (solr 6.1) cluster. Recently due to dirty cow I had to reboot my Solr 
and zookeeper clusters. I rebooted the solr nodes one by one and all was fine. 
I then rebooted zookeeper nodes 1 and 2 (with at least 10 minute delay between 
reboots) and again all was fine - no errors reported in zookeepers RUOK, 
solcloud cluster health was all green. When I rebooted ZK 3 solr reported it 
could no longer connect to ZK and all the leaders lost their replicas. After a 
short time solr started rebuilding its replicas (it recovered all 
automagically) - I didn’t restart solr. The only issue was a spike in load on 
the solr leaders.
    >
    > My best guess is that solrcloud doesn’t reconnect effectively if a 
zookeeper node disappears for a period (zkClientTimeout is set to 15 sec 
(15000)).
    >
    > Relevant config in start-up script: -DzkClientTimeout=1500 
-DzkHost=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/solr/production
    >
    > My questions:
    > Has anyone experienced this upon rebooting zookeeper? Any advice if 
anything I did above was wrong? - should I increase zkClientTimeout?
    > Any monitoring that would alert me that solr has an issue connecting to 
an individual ZK node (well that would have alerted me before I rebooted ZK3).
    > Any other relevant info from the docs I should be reading? (I believe 
have read/looked relatively exhaustively)
    >
    > Thanks, let me know if further info is required, I unfortunately didn’t 
collect logs for this period. My next step is to reproduce in non-prod (but 
thought I’d reach out first).
    > - Brendan
    >
    


Reply via email to