This seems very weird. Do the Zookeepers know about each other
correctly? Some evidence for mis-configured Zookeepers might be if you
rebooted ZK3 and had this happen again.

But that's a wild shot in the dark.

Best,
Erick

On Sun, Nov 27, 2016 at 4:42 PM, The Dobbo <ozb...@gmail.com> wrote:
> Hi,
> I have a 3 external node ZK (zookeeper-3.4.8) cluster managing my 6 node 
> solrcloud (solr 6.1) cluster. Recently due to dirty cow I had to reboot my 
> Solr and zookeeper clusters. I rebooted the solr nodes one by one and all was 
> fine. I then rebooted zookeeper nodes 1 and 2 (with at least 10 minute delay 
> between reboots) and again all was fine - no errors reported in zookeepers 
> RUOK, solcloud cluster health was all green. When I rebooted ZK 3 solr 
> reported it could no longer connect to ZK and all the leaders lost their 
> replicas. After a short time solr started rebuilding its replicas (it 
> recovered all automagically) - I didn’t restart solr. The only issue was a 
> spike in load on the solr leaders.
>
> My best guess is that solrcloud doesn’t reconnect effectively if a zookeeper 
> node disappears for a period (zkClientTimeout is set to 15 sec (15000)).
>
> Relevant config in start-up script: -DzkClientTimeout=1500 
> -DzkHost=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/solr/production
>
> My questions:
> Has anyone experienced this upon rebooting zookeeper? Any advice if anything 
> I did above was wrong? - should I increase zkClientTimeout?
> Any monitoring that would alert me that solr has an issue connecting to an 
> individual ZK node (well that would have alerted me before I rebooted ZK3).
> Any other relevant info from the docs I should be reading? (I believe have 
> read/looked relatively exhaustively)
>
> Thanks, let me know if further info is required, I unfortunately didn’t 
> collect logs for this period. My next step is to reproduce in non-prod (but 
> thought I’d reach out first).
> - Brendan
>

Reply via email to