Good news, or bad... I’ve just tried to reproduce in a new environment and everything worked as expected. Perhaps something else was at work in the environment at the time. I’ll wait until I need to reboot prod zookeeper again and I’ll grab logs and zookeeper health whilst I do the reboots, to see if prod reproduces the issue. I’ll add in checks on each solr node to check its connection with each zookeeper node. Thanks again for your help, Brendan
> On 28 Nov. 2016, at 1:43 pm, Brendan Dobinson <ozb...@gmail.com> wrote: > > Thanks for the response. I have never noticed any inconsistencies with > zookeeper and the leader knows it has two followers (of course this is stats > from now, I wish I had looked at this before the restarts) > > #echo mntr | nc 127.0.0.1 2181 > zk_version 3.4.8--1, built on 02/06/2016 03:18 GMT > … > zk_server_state leader > .. > zk_followers 2 > zk_synced_followers 2 > .. > > Also the solr nodes are connecting effectively with the different ZK nodes: > (one box for example) > #echo stat | nc 127.0.0.1 2181 > Zookeeper version: 3.4.8--1, built on 02/06/2016 03:18 GMT > Clients: > /192.91.6.40:45450[1](queued=0,recved=598765,sent=601713) > /192.91.6.27:50060[1](queued=0,recved=595002,sent=598030) > /192.91.6.204:39008[1](queued=0,recved=625580,sent=628276) > /127.0.0.1:38748[0](queued=0,recved=1,sent=0) > > Each ZK node lists one or more of the solr boxes as clients. > > Cheers, > Brendan > > On 28/11/16, 12:11 pm, "Erick Erickson" <erickerick...@gmail.com> wrote: > > This seems very weird. Do the Zookeepers know about each other > correctly? Some evidence for mis-configured Zookeepers might be if you > rebooted ZK3 and had this happen again. > > But that's a wild shot in the dark. > > Best, > Erick > > On Sun, Nov 27, 2016 at 4:42 PM, The Dobbo <ozb...@gmail.com> wrote: >> Hi, >> I have a 3 external node ZK (zookeeper-3.4.8) cluster managing my 6 node >> solrcloud (solr 6.1) cluster. Recently due to dirty cow I had to reboot my >> Solr and zookeeper clusters. I rebooted the solr nodes one by one and all >> was fine. I then rebooted zookeeper nodes 1 and 2 (with at least 10 minute >> delay between reboots) and again all was fine - no errors reported in >> zookeepers RUOK, solcloud cluster health was all green. When I rebooted ZK 3 >> solr reported it could no longer connect to ZK and all the leaders lost >> their replicas. After a short time solr started rebuilding its replicas (it >> recovered all automagically) - I didn’t restart solr. The only issue was a >> spike in load on the solr leaders. >> >> My best guess is that solrcloud doesn’t reconnect effectively if a zookeeper >> node disappears for a period (zkClientTimeout is set to 15 sec (15000)). >> >> Relevant config in start-up script: -DzkClientTimeout=1500 >> -DzkHost=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/solr/production >> >> My questions: >> Has anyone experienced this upon rebooting zookeeper? Any advice if anything >> I did above was wrong? - should I increase zkClientTimeout? >> Any monitoring that would alert me that solr has an issue connecting to an >> individual ZK node (well that would have alerted me before I rebooted ZK3). >> Any other relevant info from the docs I should be reading? (I believe have >> read/looked relatively exhaustively) >> >> Thanks, let me know if further info is required, I unfortunately didn’t >> collect logs for this period. My next step is to reproduce in non-prod (but >> thought I’d reach out first). >> - Brendan >> > > >