On 8/18/2017 1:05 PM, Joe Obernberger wrote:
> Thank you Shawn.  Please see:
> http://www.lovehorsepower.com/Vesta
> for screen shots of top
> (http://www.lovehorsepower.com/Vesta/VestaSolr6.6.0_top.jpg) and
> several screen shots over various times of jvisualvm.
>
> There is also the GC log and the regular solr.log for one server
> (named Vesta).  Please note that we are using HDFS for storage.  I
> love top, but also use htop and atop as they show additional
> information.  In general we are RAM limited and therefore do not have
> much cache for OS/disk as we would like, but this issue is CPU
> related.  After restarting the one node, the CPU usage stayed low for
> a while, but then eventually comes up to ~800% where it will stay. 

Your GC log does not show any evidence of extreme GC activity.  The
longest pause in the whole thing is 1.4 seconds, and the average pause
is only seven milliseconds.  Looking at percentile statistics, GC
performance is amazing, especially given the rather large heap size.

Problems with insufficient disk caching memory do frequently manifest as
high CPU usage, because that situation will require waiting on I/O. 
When the CPU spends a lot of time in iowait, total CPU usage tends to be
very high.  The iowait CPU percentage on the top output when that
screenshot was taken was 8.5.  This sounds like a small number, but in
fact it is quite high.  Very healthy Solr installs will have an
extremely low iowait percentage -- possibly zero -- because they will
rarely read off the disk.  I can see that on the atop screenshot, iowait
percentage is 172.

The load average on the system is well above 11. The atop output shows
24 CPU cores (which might actually be 12 if the CPUs have
hypherthreading).  Even with all those CPUs, that load average is high
enough to be concerned.

I can see that the system has about 70GB of memory directly allocated to
various Java processes, leaving about 30GB for disk caching purposes. 
Walter has noted that those same java processes have allocated over
200GB of virtual memory.  If we subtract the 70GB of allocated heap,
this would tend to indicate that those processes, one of which is Solr,
are accessing about 130GB of data.

I have no idea how the memory situation works with HDFS, or how this
screenshot should look on a healthy system.  Having 30GB of memory to
cache the 130GB of data opened by these Java processes might be enough,
or it might not.  If this were a system NOT running HDFS, then I would
say that there isn't enough memory.  Putting HDFS into this mix makes it
difficult for me to say anything useful, simply because I do not know
much about it.  You should consult with an HDFS expert and ask them how
to make sure that actual disk accesses are rare -- you want as much of
the index data sitting in RAM on the Solr server as you can possibly get.

Addressing a message later in the thread: The concern with high virtual
memory is actually NOT swapping.  It's effective use of disk caching
memory.  Let's examine a hypothetical situation with a machine running
nothing but Solr, using a standard filesystem for data storage.

The "top" output in this hypothetical situation indicates that total
system memory is 128GB and there is no swap usage.  The Solr process has
a RES memory size of 25GB, a SHR size of a few megabytes, and a VIRT
size of 1000GB.  This tells me that their heap is approximately 25 GB,
and that Solr is accessing 975GB of index data.  At that point, I know
that they have about 103GB of memory to cache nearly a terabyte of index
data.  This is a situation where there is nowhere near enough memory for
good performance.

Thanks,
Shawn

Reply via email to