On 8/18/2017 1:05 PM, Joe Obernberger wrote: > Thank you Shawn. Please see: > http://www.lovehorsepower.com/Vesta > for screen shots of top > (http://www.lovehorsepower.com/Vesta/VestaSolr6.6.0_top.jpg) and > several screen shots over various times of jvisualvm. > > There is also the GC log and the regular solr.log for one server > (named Vesta). Please note that we are using HDFS for storage. I > love top, but also use htop and atop as they show additional > information. In general we are RAM limited and therefore do not have > much cache for OS/disk as we would like, but this issue is CPU > related. After restarting the one node, the CPU usage stayed low for > a while, but then eventually comes up to ~800% where it will stay.
Your GC log does not show any evidence of extreme GC activity. The longest pause in the whole thing is 1.4 seconds, and the average pause is only seven milliseconds. Looking at percentile statistics, GC performance is amazing, especially given the rather large heap size. Problems with insufficient disk caching memory do frequently manifest as high CPU usage, because that situation will require waiting on I/O. When the CPU spends a lot of time in iowait, total CPU usage tends to be very high. The iowait CPU percentage on the top output when that screenshot was taken was 8.5. This sounds like a small number, but in fact it is quite high. Very healthy Solr installs will have an extremely low iowait percentage -- possibly zero -- because they will rarely read off the disk. I can see that on the atop screenshot, iowait percentage is 172. The load average on the system is well above 11. The atop output shows 24 CPU cores (which might actually be 12 if the CPUs have hypherthreading). Even with all those CPUs, that load average is high enough to be concerned. I can see that the system has about 70GB of memory directly allocated to various Java processes, leaving about 30GB for disk caching purposes. Walter has noted that those same java processes have allocated over 200GB of virtual memory. If we subtract the 70GB of allocated heap, this would tend to indicate that those processes, one of which is Solr, are accessing about 130GB of data. I have no idea how the memory situation works with HDFS, or how this screenshot should look on a healthy system. Having 30GB of memory to cache the 130GB of data opened by these Java processes might be enough, or it might not. If this were a system NOT running HDFS, then I would say that there isn't enough memory. Putting HDFS into this mix makes it difficult for me to say anything useful, simply because I do not know much about it. You should consult with an HDFS expert and ask them how to make sure that actual disk accesses are rare -- you want as much of the index data sitting in RAM on the Solr server as you can possibly get. Addressing a message later in the thread: The concern with high virtual memory is actually NOT swapping. It's effective use of disk caching memory. Let's examine a hypothetical situation with a machine running nothing but Solr, using a standard filesystem for data storage. The "top" output in this hypothetical situation indicates that total system memory is 128GB and there is no swap usage. The Solr process has a RES memory size of 25GB, a SHR size of a few megabytes, and a VIRT size of 1000GB. This tells me that their heap is approximately 25 GB, and that Solr is accessing 975GB of index data. At that point, I know that they have about 103GB of memory to cache nearly a terabyte of index data. This is a situation where there is nowhere near enough memory for good performance. Thanks, Shawn