We do perform a lot of sorting - on multiple fields in fact. We have
different kinds of Solr configurations - our news searches do little
with regards to faceting, but heavily sort. We provide classified ad
searches and that heavily uses faceting. I might try reducing the JVM
memory some and amount of perm generation as suggested earlier. It feels
like a GC issue and loading the cache just happens to be the victim of a
stop-the-world event at the worse possible time.

> My gut instinct is that your heap size is way too high. Try decreasing it to 
> like 5-10G. I know you say it uses more than that, but that just seems 
> bizarre unless you're doing something like faceting and/or sorting on every 
> field.
> 
> -Michael
> 
> -----Original Message-----
> From: Patrick O'Lone [mailto:pol...@townnews.com] 
> Sent: Tuesday, November 26, 2013 11:59 AM
> To: solr-user@lucene.apache.org
> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
> 
> I've been tracking a problem in our Solr environment for awhile with periodic 
> stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I 
> might get some insight from some others on this list.
> 
> The load on the server is normally anywhere between 1-3. It's an 8-core 
> machine with 40GB of RAM. I have about 25GB of index data that is replicated 
> to this server every 5 minutes. It's taking about 200 connections per second 
> and roughly every 5-10 minutes it will stall for about 30 seconds to a 
> minute. The stall causes the load to go to as high as 90. It is all CPU bound 
> in user space - all cores go to 99% utilization (spinlock?). When doing a 
> thread dump, the following line is blocked in all running Tomcat threads:
> 
> org.apache.lucene.search.FieldCacheImpl$Cache.get (
> FieldCacheImpl.java:230 )
> 
> Looking the source code in 3.6.1, that is a function call to
> syncronized() which blocks all threads and causes the backlog. I've tried to 
> correlate these events to the replication events - but even with replication 
> disabled - this still happens. We run multiple data centers using Solr and I 
> was comparing garbage collection processes between and noted that the old 
> generation is collected very differently on this data center versus others. 
> The old generation is collected as a massive collect event (several gigabytes 
> worth) - the other data center is more saw toothed and collects only in 
> 500MB-1GB at a time. Here's my parameters to java (the same in all 
> environments):
> 
> /usr/java/jre/bin/java \
> -verbose:gc \
> -XX:+PrintGCDetails \
> -server \
> -Dcom.sun.management.jmxremote \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:+CMSIncrementalMode \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+CMSIncrementalPacing \
> -XX:NewRatio=3 \
> -Xms30720M \
> -Xmx30720M \
> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath 
> /usr/local/share/apache-tomcat/bin/bootstrap.jar \ 
> -Dcatalina.base=/usr/local/share/apache-tomcat \ 
> -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ 
> org.apache.catalina.startup.Bootstrap start
> 
> I've tried a few GC option changes from this (been running this way for a 
> couple of years now) - primarily removing CMS Incremental mode as we have 8 
> cores and remarks on the internet suggest that it is only for smaller SMP 
> setups. Removing CMS did not fix anything.
> 
> I've considered that the heap is way too large (30GB from 40GB) and may not 
> leave enough memory for mmap operations (MMap appears to be used in the field 
> cache). Based on active memory utilization in Java, seems like I might be 
> able to reduce down to 22GB safely - but I'm not sure if that will help with 
> the CPU issues.
> 
> I think field cache is used for sorting and faceting. I've started to 
> investigate facet.method, but from what I can tell, this doesn't seem to 
> influence sorting at all - only facet queries. I've tried setting 
> useFilterForSortQuery, and seems to require less field cache but doesn't 
> address the stalling issues.
> 
> Is there something I am overlooking? Perhaps the system is becoming 
> oversubscribed in terms of resources? Thanks for any help that is offered.
> 
> --
> Patrick O'Lone
> Director of Software Development
> TownNews.com
> 
> E-mail ... pol...@townnews.com
> Phone .... 309-743-0809
> Fax ...... 309-743-0830
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Reply via email to