Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Guido Medina Wed, 27 Nov 2013 01:26:52 -0800

We have a webapp running with a very high HEAP size (24GB) and we haveno problems with it AFTER we enabled the new GC that is meant to replacesometime in the future the CMS GC, but you have to have Java 6 update"Some number I couldn't find but latest should cover" to be able to use:


1. Remove all GC options you have and...
2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/

As a test of course, more information you can read on the following (andinteresting) article, we also have Solr running with these options, nomore pauses or HEAP size hitting the sky.

Don't get bored reading the 1st (and small) introduction page of thearticle, page 2 and 3 will make lot of sense:http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061


HTH,

Guido.

On 26/11/13 21:59, Patrick O'Lone wrote:

We do perform a lot of sorting - on multiple fields in fact. We have
different kinds of Solr configurations - our news searches do little
with regards to faceting, but heavily sort. We provide classified ad
searches and that heavily uses faceting. I might try reducing the JVM
memory some and amount of perm generation as suggested earlier. It feels
like a GC issue and loading the cache just happens to be the victim of a
stop-the-world event at the worse possible time.

My gut instinct is that your heap size is way too high. Try decreasing it to 
like 5-10G. I know you say it uses more than that, but that just seems bizarre 
unless you're doing something like faceting and/or sorting on every field.

-Michael

-----Original Message-----
From: Patrick O'Lone [mailto:pol...@townnews.com]
Sent: Tuesday, November 26, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic 
stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I 
might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine 
with 40GB of RAM. I have about 25GB of index data that is replicated to this 
server every 5 minutes. It's taking about 200 connections per second and 
roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The 
stall causes the load to go to as high as 90. It is all CPU bound in user space 
- all cores go to 99% utilization (spinlock?). When doing a thread dump, the 
following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to 
correlate these events to the replication events - but even with replication 
disabled - this still happens. We run multiple data centers using Solr and I 
was comparing garbage collection processes between and noted that the old 
generation is collected very differently on this data center versus others. The 
old generation is collected as a massive collect event (several gigabytes 
worth) - the other data center is more saw toothed and collects only in 
500MB-1GB at a time. Here's my parameters to java (the same in all 
environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath 
/usr/local/share/apache-tomcat/bin/bootstrap.jar \ 
-Dcatalina.base=/usr/local/share/apache-tomcat \ 
-Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ 
org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a 
couple of years now) - primarily removing CMS Incremental mode as we have 8 
cores and remarks on the internet suggest that it is only for smaller SMP 
setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not 
leave enough memory for mmap operations (MMap appears to be used in the field 
cache). Based on active memory utilization in Java, seems like I might be able 
to reduce down to 22GB safely - but I'm not sure if that will help with the CPU 
issues.

I think field cache is used for sorting and faceting. I've started to 
investigate facet.method, but from what I can tell, this doesn't seem to 
influence sorting at all - only facet queries. I've tried setting 
useFilterForSortQuery, and seems to require less field cache but doesn't 
address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming 
oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Reply via email to