I am not completely sure about that, but if I remember correctly (it has been 
more than one year since I've did that and I was lazy enogh not to take notes 
:( ), it helped that I've reduced the percentage of size of permanent 
generation (somehow, more GC on less permanent gen, but this one is not 
blocking the system and it could be that it prevents really large GC's - at the 
account of more smaller ones). But it is far from sound advice, it is just 
somehow distant memory and I've could also mixed things up in my memory (been 
doing many other things in between), so my advice could as well be misleading 
(and make sure that your heap is still big enough, once you get bellow 
reasonable value, nothing will help). 
P.S. if it worked for you, just let us know. 

Regards
Patrice Monroe Pustavrh, 
Software developer, 
Bisnode Slovenia d.o.o.

-----Original Message-----
From: Patrick O'Lone [mailto:pol...@townnews.com] 
Sent: Tuesday, November 26, 2013 5:59 PM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic 
stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I 
might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine 
with 40GB of RAM. I have about 25GB of index data that is replicated to this 
server every 5 minutes. It's taking about 200 connections per second and 
roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The 
stall causes the load to go to as high as 90. It is all CPU bound in user space 
- all cores go to 99% utilization (spinlock?). When doing a thread dump, the 
following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to 
correlate these events to the replication events - but even with replication 
disabled - this still happens. We run multiple data centers using Solr and I 
was comparing garbage collection processes between and noted that the old 
generation is collected very differently on this data center versus others. The 
old generation is collected as a massive collect event (several gigabytes 
worth) - the other data center is more saw toothed and collects only in 
500MB-1GB at a time. Here's my parameters to java (the same in all 
environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath 
/usr/local/share/apache-tomcat/bin/bootstrap.jar \ 
-Dcatalina.base=/usr/local/share/apache-tomcat \ 
-Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ 
org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a 
couple of years now) - primarily removing CMS Incremental mode as we have 8 
cores and remarks on the internet suggest that it is only for smaller SMP 
setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not 
leave enough memory for mmap operations (MMap appears to be used in the field 
cache). Based on active memory utilization in Java, seems like I might be able 
to reduce down to 22GB safely - but I'm not sure if that will help with the CPU 
issues.

I think field cache is used for sorting and faceting. I've started to 
investigate facet.method, but from what I can tell, this doesn't seem to 
influence sorting at all - only facet queries. I've tried setting 
useFilterForSortQuery, and seems to require less field cache but doesn't 
address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming 
oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Reply via email to