Hello everyone,
I have a nasty problem with the scheduled Solr collections backup. From
time to time when a scheduled backup is triggered (backup operation takes
around 10 minutes) Solr freezes for 20-30 seconds. The freeze happens on
one Solr instance at time but this affects all queries latency (because of
distributed queries on 6 shards). I can reproduce the problem only when
updates in the Solr cluster are enabled. When I disable updates, the
problem is gone.

Lucene index is not big and fits into OS cache. I am wondering if taking a
backup can be the culprit of the problem. I'm wondering if the process
messes up operating system caches. Maybe all the files which are copied to
NFS are eating up the OS cache and when the OS reaches high memory usage it
starts cleaning up memory and making Solr to freeze.

During the time of freeze monitoring charts are showing higher IO wait
times. In addition to that Solr nodes which seem to be affected are
reaching 95-100% total memory usage (used + buffers + caches).

I cannot see anything valuable in GC logs apart from a message which
suggests that the application was stopped for 20-30 seconds (Application
time).

The cluster consists of 12 machines. Each Solr is running on Ubuntu 16.04.
All the servers are running in AWS EC2. Each Solr node is running inside
Docker. EC2 instances have local SSD disks (but the same problem appeared
with EBS).

Does anyone have a similar problem and can share some thoughts? I'll
appreciate all help.

--
Pawel Rog

Reply via email to