Hello everyone, I have a nasty problem with the scheduled Solr collections backup. From time to time when a scheduled backup is triggered (backup operation takes around 10 minutes) Solr freezes for 20-30 seconds. The freeze happens on one Solr instance at time but this affects all queries latency (because of distributed queries on 6 shards). I can reproduce the problem only when updates in the Solr cluster are enabled. When I disable updates, the problem is gone.
Lucene index is not big and fits into OS cache. I am wondering if taking a backup can be the culprit of the problem. I'm wondering if the process messes up operating system caches. Maybe all the files which are copied to NFS are eating up the OS cache and when the OS reaches high memory usage it starts cleaning up memory and making Solr to freeze. During the time of freeze monitoring charts are showing higher IO wait times. In addition to that Solr nodes which seem to be affected are reaching 95-100% total memory usage (used + buffers + caches). I cannot see anything valuable in GC logs apart from a message which suggests that the application was stopped for 20-30 seconds (Application time). The cluster consists of 12 machines. Each Solr is running on Ubuntu 16.04. All the servers are running in AWS EC2. Each Solr node is running inside Docker. EC2 instances have local SSD disks (but the same problem appeared with EBS). Does anyone have a similar problem and can share some thoughts? I'll appreciate all help. -- Pawel Rog