On 9/19/2013 9:20 AM, Neil Prosser wrote:
> Apologies for the giant email. Hopefully it makes sense.

Because of its size, I'm going to reply inline like this and I'm going
to trim out portions of your original message.  I hope that's not
horribly confusing to you!  Looking through my archive of the mailing
list, I see that I have given you some of this information before.

> Our index size ranges between 144GB and 200GB (when we optimise it back
> down, since we've had bad experiences with large cores). We've got just
> over 37M documents some are smallish but most range between 1000-6000
> bytes. We regularly update documents so large portions of the index will be
> touched leading to a maxDocs value of around 43M.
> 
> Query load ranges between 400req/s to 800req/s across the five slaves
> throughout the day, increasing and decreasing gradually over a period of
> hours, rather than bursting.

With indexes of that size and 96GB of RAM, you're starting to get into
the size range where severe performance problems begin happening.  Also,
with no GC tuning other than turning on CMS (and a HUGE 48GB heap on top
of that), you're going to run into extremely long GC pause times.  Your
query load is what I would call quite high, which will make those GC
problems quite frequent.

This is the problem I was running into with only an 8GB heap, with
similar tuning where I just turned on CMS.  When Solr disappears for 10+
seconds at a time for garbage collection, the load balancer will
temporarily drop that server from the available pool.

I'm aware that this is your old setup, so we'll put it aside for now  so
we can concentrate on your SolrCloud setup.

> Most of our documents have upwards of twenty fields. We use different
> fields to store territory variant (we have around 30 territories) values
> and also boost based on the values in some of these fields (integer ones).
> 
> So an average query can do a range filter by two of the territory variant
> fields, filter by a non-territory variant field. Facet by a field or two
> (may be territory variant). Bring back the values of 60 fields. Boost query
> on field values of a non-territory variant field. Boost by values of two
> territory-variant fields. Dismax query on up to 20 fields (with boosts) and
> phrase boost on those fields too. They're pretty big queries. We don't do
> any index-time boosting. We try to keep things dynamic so we can alter our
> boosts on-the-fly.

The nature of your main queries (and possibly your filters) is probably
always going to be a little memory hungry, but it sounds like the facets
are probably what's requiring such incredible amounts of heap RAM.

Try putting a facet.method parameter into your request handler defaults
and set it to "enum".  The default is "fc" which means fieldcache - it
basically loads all the indexed terms for that field on the entire index
into the field cache.  Multiply that by the number of fields that you
facet on (across all your queries), and it can be a real problem.
Memory is always going to be required for quick facets, but it's
generally better to let the OS handle it automatically with disk caching
than to load it into the java heap.

Your next paragraph (which I trimmed) talks about sorting, which is
another thing that eats up java heap.  The amount taken is based on the
number of documents in the index, and a chunk is taken for every field
that you use for sorting.  See if you can reduce the number of fields
you use for sorting.

> Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
> memory) and four shards on two boxes and three on the rest I still see
> concurrent mode failure. This looks like it's causing ZooKeeper to mark the
> node as down and things begin to struggle.
> 
> Is concurrent mode failure just something that will inevitably happen or is
> it avoidable by dropping the CMSInitiatingOccupancyFraction?

I assume that concurrent mode failure is what gets logged preceding a
full garbage collection.  Aggressively tuning your GC will help
immensely.  The link below has what I am currently using.  Someone on
IRC was saying that they have a 48GB heap with similar settings and they
never see huge pauses.  These tuning parameters don't use fixed memory
sizes, so it should work with any size max heap:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Otis has mentioned G1.  What I found when I used G1 was that it worked
extremely well *almost* all of the time.  The occasions for full garbage
collections were a LOT less frequent, but when they happened, the pause
was *even longer* than the untuned CMS.  That caused big problems for me
and my load balancer.  Until someone can come up with some awesome G1
tuning parameters, I personally will continue to avoid it except for
small-heap applications.  G1 is an awesome idea.  If it can be tuned, it
will probably be better than a tuned CMS.

Switching to facet.method=enum as outlined above will probably do the
most for letting you decrease your max java heap.  Combining that with
tuning parameters for GC might get rid of these problems entirely.
Here's a wiki page where I have distilled all the performance problems
I've encountered myself and helping others:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Reply via email to