Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Patrick O'Lone Mon, 09 Dec 2013 09:15:26 -0800

Yeah, I tried G1, but it did not help - I don't think it is a garbage
collection issue. I've made various changes to iCMS as well and the
issue ALWAYS happens - no matter what I do. If I'm taking heavy traffic
(200 requests per second) - as soon as I hit a 5 minute mark - the world
stops - garbage collection would be less predictable. Nearly all of my
requests have this 5 minute windowing behavior on time though, which is
why I have it as a strong suspect now. If it blocks on that - even for a
couple of seconds, my traffic backlog will be 600-800 requests.


> Did you add the Garbage collection JVM options I suggested you?
> 
> -XX:+UseG1GC -XX:MaxGCPauseMillis=50
> 
> Guido.
> 
> On 09/12/13 16:33, Patrick O'Lone wrote:
>> Unfortunately, in a test environment, this happens in version 4.4.0 of
>> Solr as well.
>>
>>> I was trying to locate the release notes for 3.6.x it is too old, if I
>>> were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you
>>> since it is a minor release, locate the release notes and see if
>>> something that is affecting you got fixed, also, I would be thinking on
>>> moving on to 4.x which is quite stable and fast.
>>>
>>> Like anything with Java and concurrency, it will just get better (and
>>> faster) with bigger numbers and concurrency frameworks becoming more and
>>> more reliable, standard and stable.
>>>
>>> Regards,
>>>
>>> Guido.
>>>
>>> On 09/12/13 15:07, Patrick O'Lone wrote:
>>>> I have a new question about this issue - I create a filter queries of
>>>> the form:
>>>>
>>>> fq=start_time:[* TO NOW/5MINUTE]
>>>>
>>>> This is used to restrict the set of documents to only items that have a
>>>> start time within the next 5 minutes. Most of my indexes have millions
>>>> of documents with few documents that start sometime in the future.
>>>> Nearly all of my queries include this, would this cause every other
>>>> search thread to block until the filter query is re-cached every 5
>>>> minutes and if so, is there a better way to do it? Thanks for any
>>>> continued help with this issue!
>>>>
>>>>> We have a webapp running with a very high HEAP size (24GB) and we have
>>>>> no problems with it AFTER we enabled the new GC that is meant to
>>>>> replace
>>>>> sometime in the future the CMS GC, but you have to have Java 6 update
>>>>> "Some number I couldn't find but latest should cover" to be able to
>>>>> use:
>>>>>
>>>>> 1. Remove all GC options you have and...
>>>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>>>>
>>>>> As a test of course, more information you can read on the following
>>>>> (and
>>>>> interesting) article, we also have Solr running with these options, no
>>>>> more pauses or HEAP size hitting the sky.
>>>>>
>>>>> Don't get bored reading the 1st (and small) introduction page of the
>>>>> article, page 2 and 3 will make lot of sense:
>>>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> HTH,
>>>>>
>>>>> Guido.
>>>>>
>>>>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>>>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>>>>> different kinds of Solr configurations - our news searches do little
>>>>>> with regards to faceting, but heavily sort. We provide classified ad
>>>>>> searches and that heavily uses faceting. I might try reducing the JVM
>>>>>> memory some and amount of perm generation as suggested earlier. It
>>>>>> feels
>>>>>> like a GC issue and loading the cache just happens to be the victim
>>>>>> of a
>>>>>> stop-the-world event at the worse possible time.
>>>>>>
>>>>>>> My gut instinct is that your heap size is way too high. Try
>>>>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>>>>> but that just seems bizarre unless you're doing something like
>>>>>>> faceting and/or sorting on every field.
>>>>>>>
>>>>>>> -Michael
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Patrick O'Lone [mailto:pol...@townnews.com]
>>>>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field
>>>>>>> cache
>>>>>>>
>>>>>>> I've been tracking a problem in our Solr environment for awhile with
>>>>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>>>>> try and thought I might get some insight from some others on this
>>>>>>> list.
>>>>>>>
>>>>>>> The load on the server is normally anywhere between 1-3. It's an
>>>>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data
>>>>>>> that
>>>>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>>>>> connections per second and roughly every 5-10 minutes it will stall
>>>>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>>>>> line is blocked in all running Tomcat threads:
>>>>>>>
>>>>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>>>>> FieldCacheImpl.java:230 )
>>>>>>>
>>>>>>> Looking the source code in 3.6.1, that is a function call to
>>>>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>>>>> tried to correlate these events to the replication events - but even
>>>>>>> with replication disabled - this still happens. We run multiple data
>>>>>>> centers using Solr and I was comparing garbage collection processes
>>>>>>> between and noted that the old generation is collected very
>>>>>>> differently on this data center versus others. The old generation is
>>>>>>> collected as a massive collect event (several gigabytes worth) - the
>>>>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>>>>> at a time. Here's my parameters to java (the same in all
>>>>>>> environments):
>>>>>>>
>>>>>>> /usr/java/jre/bin/java \
>>>>>>> -verbose:gc \
>>>>>>> -XX:+PrintGCDetails \
>>>>>>> -server \
>>>>>>> -Dcom.sun.management.jmxremote \
>>>>>>> -XX:+UseConcMarkSweepGC \
>>>>>>> -XX:+UseParNewGC \
>>>>>>> -XX:+CMSIncrementalMode \
>>>>>>> -XX:+CMSParallelRemarkEnabled \
>>>>>>> -XX:+CMSIncrementalPacing \
>>>>>>> -XX:NewRatio=3 \
>>>>>>> -Xms30720M \
>>>>>>> -Xmx30720M \
>>>>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>>>>
>>>>>>> I've tried a few GC option changes from this (been running this way
>>>>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>>>>
>>>>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>>>>> used in the field cache). Based on active memory utilization in
>>>>>>> Java,
>>>>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>>>>> not sure if that will help with the CPU issues.
>>>>>>>
>>>>>>> I think field cache is used for sorting and faceting. I've
>>>>>>> started to
>>>>>>> investigate facet.method, but from what I can tell, this doesn't
>>>>>>> seem
>>>>>>> to influence sorting at all - only facet queries. I've tried setting
>>>>>>> useFilterForSortQuery, and seems to require less field cache but
>>>>>>> doesn't address the stalling issues.
>>>>>>>
>>>>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>>>>> offered.
>>>>>>>
>>>>>>> -- 
>>>>>>> Patrick O'Lone
>>>>>>> Director of Software Development
>>>>>>> TownNews.com
>>>>>>>
>>>>>>> E-mail ... pol...@townnews.com
>>>>>>> Phone .... 309-743-0809
>>>>>>> Fax ...... 309-743-0830
>>>>>>>
>>>>>>>
>>>
>>
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Reply via email to