I am not completely sure about that, but if I remember correctly (it has been more than one year since I've did that and I was hmm.. whatever you want to write here, enogh not to take notes :( ), it helped that I've reduced the percentage of size of permanent generation (somehow, more GC on less permanent gen, but this one is not blocking the system and it could be that it prevents really large GC's - at the account of more smaller ones). But it is far from sound advice, it is just somehow distant memory and I've could also mixed things up in my memory (been doing many other things in between), so my advice could as well be misleading (and make sure that your heap is still big enough, once you get bellow reasonable value, nothing will help). P.S. if it worked for you, just let us know.
Regards Patrice Monroe Pustavrh, Software developer, Bisnode Slovenia d.o.o. -----Original Message----- From: Patrick O'Lone [mailto:pol...@townnews.com] Sent: Tuesday, November 26, 2013 5:59 PM To: solr-user@lucene.apache.org Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list. The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads: org.apache.lucene.search.FieldCacheImpl$Cache.get ( FieldCacheImpl.java:230 ) Looking the source code in 3.6.1, that is a function call to syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments): /usr/java/jre/bin/java \ -verbose:gc \ -XX:+PrintGCDetails \ -server \ -Dcom.sun.management.jmxremote \ -XX:+UseConcMarkSweepGC \ -XX:+UseParNewGC \ -XX:+CMSIncrementalMode \ -XX:+CMSParallelRemarkEnabled \ -XX:+CMSIncrementalPacing \ -XX:NewRatio=3 \ -Xms30720M \ -Xmx30720M \ -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything. I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues. I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues. Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered. -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone .... 309-743-0809 Fax ...... 309-743-0830