We do perform a lot of sorting - on multiple fields in fact. We have different kinds of Solr configurations - our news searches do little with regards to faceting, but heavily sort. We provide classified ad searches and that heavily uses faceting. I might try reducing the JVM memory some and amount of perm generation as suggested earlier. It feels like a GC issue and loading the cache just happens to be the victim of a stop-the-world event at the worse possible time.
> My gut instinct is that your heap size is way too high. Try decreasing it to > like 5-10G. I know you say it uses more than that, but that just seems > bizarre unless you're doing something like faceting and/or sorting on every > field. > > -Michael > > -----Original Message----- > From: Patrick O'Lone [mailto:pol...@townnews.com] > Sent: Tuesday, November 26, 2013 11:59 AM > To: solr-user@lucene.apache.org > Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache > > I've been tracking a problem in our Solr environment for awhile with periodic > stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I > might get some insight from some others on this list. > > The load on the server is normally anywhere between 1-3. It's an 8-core > machine with 40GB of RAM. I have about 25GB of index data that is replicated > to this server every 5 minutes. It's taking about 200 connections per second > and roughly every 5-10 minutes it will stall for about 30 seconds to a > minute. The stall causes the load to go to as high as 90. It is all CPU bound > in user space - all cores go to 99% utilization (spinlock?). When doing a > thread dump, the following line is blocked in all running Tomcat threads: > > org.apache.lucene.search.FieldCacheImpl$Cache.get ( > FieldCacheImpl.java:230 ) > > Looking the source code in 3.6.1, that is a function call to > syncronized() which blocks all threads and causes the backlog. I've tried to > correlate these events to the replication events - but even with replication > disabled - this still happens. We run multiple data centers using Solr and I > was comparing garbage collection processes between and noted that the old > generation is collected very differently on this data center versus others. > The old generation is collected as a massive collect event (several gigabytes > worth) - the other data center is more saw toothed and collects only in > 500MB-1GB at a time. Here's my parameters to java (the same in all > environments): > > /usr/java/jre/bin/java \ > -verbose:gc \ > -XX:+PrintGCDetails \ > -server \ > -Dcom.sun.management.jmxremote \ > -XX:+UseConcMarkSweepGC \ > -XX:+UseParNewGC \ > -XX:+CMSIncrementalMode \ > -XX:+CMSParallelRemarkEnabled \ > -XX:+CMSIncrementalPacing \ > -XX:NewRatio=3 \ > -Xms30720M \ > -Xmx30720M \ > -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath > /usr/local/share/apache-tomcat/bin/bootstrap.jar \ > -Dcatalina.base=/usr/local/share/apache-tomcat \ > -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ > org.apache.catalina.startup.Bootstrap start > > I've tried a few GC option changes from this (been running this way for a > couple of years now) - primarily removing CMS Incremental mode as we have 8 > cores and remarks on the internet suggest that it is only for smaller SMP > setups. Removing CMS did not fix anything. > > I've considered that the heap is way too large (30GB from 40GB) and may not > leave enough memory for mmap operations (MMap appears to be used in the field > cache). Based on active memory utilization in Java, seems like I might be > able to reduce down to 22GB safely - but I'm not sure if that will help with > the CPU issues. > > I think field cache is used for sorting and faceting. I've started to > investigate facet.method, but from what I can tell, this doesn't seem to > influence sorting at all - only facet queries. I've tried setting > useFilterForSortQuery, and seems to require less field cache but doesn't > address the stalling issues. > > Is there something I am overlooking? Perhaps the system is becoming > oversubscribed in terms of resources? Thanks for any help that is offered. > > -- > Patrick O'Lone > Director of Software Development > TownNews.com > > E-mail ... pol...@townnews.com > Phone .... 309-743-0809 > Fax ...... 309-743-0830 > > -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone .... 309-743-0809 Fax ...... 309-743-0830