I've been tracking a problem in our Solr environment for awhile with
periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try
and thought I might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core
machine with 40GB of RAM. I have about 25GB of index data that is
replicated to this server every 5 minutes. It's taking about 200
connections per second and roughly every 5-10 minutes it will stall for
about 30 seconds to a minute. The stall causes the load to go to as high
as 90. It is all CPU bound in user space - all cores go to 99%
utilization (spinlock?). When doing a thread dump, the following line is
blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've
tried to correlate these events to the replication events - but even
with replication disabled - this still happens. We run multiple data
centers using Solr and I was comparing garbage collection processes
between and noted that the old generation is collected very differently
on this data center versus others. The old generation is collected as a
massive collect event (several gigabytes worth) - the other data center
is more saw toothed and collects only in 500MB-1GB at a time. Here's my
parameters to java (the same in all environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
-classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
-Dcatalina.base=/usr/local/share/apache-tomcat \
-Dcatalina.home=/usr/local/share/apache-tomcat \
-Djava.io.tmpdir=/tmp \
org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for
a couple of years now) - primarily removing CMS Incremental mode as we
have 8 cores and remarks on the internet suggest that it is only for
smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may
not leave enough memory for mmap operations (MMap appears to be used in
the field cache). Based on active memory utilization in Java, seems like
I might be able to reduce down to 22GB safely - but I'm not sure if that
will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to
investigate facet.method, but from what I can tell, this doesn't seem to
influence sorting at all - only facet queries. I've tried setting
useFilterForSortQuery, and seems to require less field cache but doesn't
address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming
oversubscribed in terms of resources? Thanks for any help that is offered.

-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Reply via email to