Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Patrick O'Lone Mon, 09 Dec 2013 13:49:07 -0800

I initially thought this was the case as well. These are slave nodes
that receive updates every 5-10 minutes. However, this issue happens
even if replication is turned off and no update handler is provided at all.


I have confirmed against my data that simply querying the fq for a
start_time in a range takes 11-13 seconds to actually populate the
cache. If I make the fq not cache at all, my QTime raises by about
100ms, but does not have the stalling effect. A purely negative query
also seems to have this effect, that is:

fq=-start_time:[NOW/MINUTE TO *]

But, I'm not sure if that is because it actually caches the negative
query or if it discards it entirely.

> Patrick,
> 
> Are you getting these stalls following a commit? If so then the issue is
> most likely fieldCache warming pauses. To stop your users from seeing
> this pause you'll need to add static warming queries to your
> solrconfig.xml to warm the fieldCache before it's registered .
> 
> 
> On Mon, Dec 9, 2013 at 12:33 PM, Patrick O'Lone <pol...@townnews.com
> <mailto:pol...@townnews.com>> wrote:
> 
>     Well, I want to include everything will start in the next 5 minute
>     interval and everything that came before. The query is more like:
> 
>     fq=start_time:[* TO NOW+5MINUTE/5MINUTE]
> 
>     so that it rounds to the nearest 5 minute interval on the right-hand
>     side. But, as soon as 1 second after that 5 minute window, everything
>     pauses wanting for filter cache (at least that's my working theory based
>     on observation). Is it possible to do something like:
> 
>     fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE]
> 
>     where it would use the filter cache to narrow down by day resolution and
>     then filter as part of the standard query, or something like that?
> 
>     My thought is that this would still gain a benefit from a query cache,
>     but somewhat slower since it must remove results for things appearing
>     later in the day.
> 
>     > If you want a start time within the next 5 minutes, I think your
>     filter
>     > is not the good one.
>     > * will be replaced by the first date in your field
>     >
>     > Try :
>     > fq=start_time:[NOW TO NOW+5MINUTE]
>     >
>     > Franck Brisbart
>     >
>     >
>     > Le lundi 09 d�cembre 2013 � 09:07 -0600, Patrick O'Lone a �crit :
>     >> I have a new question about this issue - I create a filter queries of
>     >> the form:
>     >>
>     >> fq=start_time:[* TO NOW/5MINUTE]
>     >>
>     >> This is used to restrict the set of documents to only items that
>     have a
>     >> start time within the next 5 minutes. Most of my indexes have
>     millions
>     >> of documents with few documents that start sometime in the future.
>     >> Nearly all of my queries include this, would this cause every other
>     >> search thread to block until the filter query is re-cached every 5
>     >> minutes and if so, is there a better way to do it? Thanks for any
>     >> continued help with this issue!
>     >>
>     >>> We have a webapp running with a very high HEAP size (24GB) and
>     we have
>     >>> no problems with it AFTER we enabled the new GC that is meant to
>     replace
>     >>> sometime in the future the CMS GC, but you have to have Java 6
>     update
>     >>> "Some number I couldn't find but latest should cover" to be able
>     to use:
>     >>>
>     >>> 1. Remove all GC options you have and...
>     >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>     >>>
>     >>> As a test of course, more information you can read on the
>     following (and
>     >>> interesting) article, we also have Solr running with these
>     options, no
>     >>> more pauses or HEAP size hitting the sky.
>     >>>
>     >>> Don't get bored reading the 1st (and small) introduction page of the
>     >>> article, page 2 and 3 will make lot of sense:
>     >>>
>     
> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>     >>>
>     >>>
>     >>> HTH,
>     >>>
>     >>> Guido.
>     >>>
>     >>> On 26/11/13 21:59, Patrick O'Lone wrote:
>     >>>> We do perform a lot of sorting - on multiple fields in fact. We
>     have
>     >>>> different kinds of Solr configurations - our news searches do
>     little
>     >>>> with regards to faceting, but heavily sort. We provide
>     classified ad
>     >>>> searches and that heavily uses faceting. I might try reducing
>     the JVM
>     >>>> memory some and amount of perm generation as suggested earlier.
>     It feels
>     >>>> like a GC issue and loading the cache just happens to be the
>     victim of a
>     >>>> stop-the-world event at the worse possible time.
>     >>>>
>     >>>>> My gut instinct is that your heap size is way too high. Try
>     >>>>> decreasing it to like 5-10G. I know you say it uses more than
>     that,
>     >>>>> but that just seems bizarre unless you're doing something like
>     >>>>> faceting and/or sorting on every field.
>     >>>>>
>     >>>>> -Michael
>     >>>>>
>     >>>>> -----Original Message-----
>     >>>>> From: Patrick O'Lone [mailto:pol...@townnews.com
>     <mailto:pol...@townnews.com>]
>     >>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>     >>>>> To: solr-user@lucene.apache.org
>     <mailto:solr-user@lucene.apache.org>
>     >>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on
>     field cache
>     >>>>>
>     >>>>> I've been tracking a problem in our Solr environment for
>     awhile with
>     >>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on
>     ideas to
>     >>>>> try and thought I might get some insight from some others on
>     this list.
>     >>>>>
>     >>>>> The load on the server is normally anywhere between 1-3. It's an
>     >>>>> 8-core machine with 40GB of RAM. I have about 25GB of index
>     data that
>     >>>>> is replicated to this server every 5 minutes. It's taking
>     about 200
>     >>>>> connections per second and roughly every 5-10 minutes it will
>     stall
>     >>>>> for about 30 seconds to a minute. The stall causes the load to
>     go to
>     >>>>> as high as 90. It is all CPU bound in user space - all cores go to
>     >>>>> 99% utilization (spinlock?). When doing a thread dump, the
>     following
>     >>>>> line is blocked in all running Tomcat threads:
>     >>>>>
>     >>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>     >>>>> FieldCacheImpl.java:230 )
>     >>>>>
>     >>>>> Looking the source code in 3.6.1, that is a function call to
>     >>>>> syncronized() which blocks all threads and causes the backlog.
>     I've
>     >>>>> tried to correlate these events to the replication events -
>     but even
>     >>>>> with replication disabled - this still happens. We run
>     multiple data
>     >>>>> centers using Solr and I was comparing garbage collection
>     processes
>     >>>>> between and noted that the old generation is collected very
>     >>>>> differently on this data center versus others. The old
>     generation is
>     >>>>> collected as a massive collect event (several gigabytes worth)
>     - the
>     >>>>> other data center is more saw toothed and collects only in
>     500MB-1GB
>     >>>>> at a time. Here's my parameters to java (the same in all
>     environments):
>     >>>>>
>     >>>>> /usr/java/jre/bin/java \
>     >>>>> -verbose:gc \
>     >>>>> -XX:+PrintGCDetails \
>     >>>>> -server \
>     >>>>> -Dcom.sun.management.jmxremote \
>     >>>>> -XX:+UseConcMarkSweepGC \
>     >>>>> -XX:+UseParNewGC \
>     >>>>> -XX:+CMSIncrementalMode \
>     >>>>> -XX:+CMSParallelRemarkEnabled \
>     >>>>> -XX:+CMSIncrementalPacing \
>     >>>>> -XX:NewRatio=3 \
>     >>>>> -Xms30720M \
>     >>>>> -Xmx30720M \
>     >>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>     >>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>     >>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>     >>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>     >>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap
>     start
>     >>>>>
>     >>>>> I've tried a few GC option changes from this (been running
>     this way
>     >>>>> for a couple of years now) - primarily removing CMS
>     Incremental mode
>     >>>>> as we have 8 cores and remarks on the internet suggest that it is
>     >>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>     >>>>>
>     >>>>> I've considered that the heap is way too large (30GB from
>     40GB) and
>     >>>>> may not leave enough memory for mmap operations (MMap appears
>     to be
>     >>>>> used in the field cache). Based on active memory utilization
>     in Java,
>     >>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>     >>>>> not sure if that will help with the CPU issues.
>     >>>>>
>     >>>>> I think field cache is used for sorting and faceting. I've
>     started to
>     >>>>> investigate facet.method, but from what I can tell, this
>     doesn't seem
>     >>>>> to influence sorting at all - only facet queries. I've tried
>     setting
>     >>>>> useFilterForSortQuery, and seems to require less field cache but
>     >>>>> doesn't address the stalling issues.
>     >>>>>
>     >>>>> Is there something I am overlooking? Perhaps the system is
>     becoming
>     >>>>> oversubscribed in terms of resources? Thanks for any help that is
>     >>>>> offered.
>     >>>>>
>     >>>>> --
>     >>>>> Patrick O'Lone
>     >>>>> Director of Software Development
>     >>>>> TownNews.com
>     >>>>>
>     >>>>> E-mail ... pol...@townnews.com <mailto:pol...@townnews.com>
>     >>>>> Phone .... 309-743-0809 <tel:309-743-0809>
>     >>>>> Fax ...... 309-743-0830 <tel:309-743-0830>
>     >>>>>
>     >>>>>
>     >>>>
>     >>>
>     >>>
>     >>
>     >>
>     >
>     >
>     >
>     >
> 
> 
>     --
>     Patrick O'Lone
>     Director of Software Development
>     TownNews.com
> 
>     E-mail ... pol...@townnews.com <mailto:pol...@townnews.com>
>     Phone .... 309-743-0809 <tel:309-743-0809>
>     Fax ...... 309-743-0830 <tel:309-743-0830>
> 
> 
> 
> 
> -- 
> Joel Bernstein
> Search Engineer at Heliosearch


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Reply via email to