On 1/25/2017 12:04 PM, Greg Harris wrote:
> I think my experience to this point is G1 (barring unknown lucene bug
> risk) is actually a lower risk easier collector to use. However that
> doesn't necessarily mean better. You don't have set the space sizes or
> any number of all sorts of various parameters you seem to have to set
> with cms. It can control pause variability much more than cms does.
> CMS also has the dubious distinction of working well when things are
> fine and being a single threaded full GC disaster on failures. Some of
> the settings that Solr uses are really more matters of opinion than
> one size fits all. A 50 percent initiating ratio can consume 4 CPUs
> permanently at the default settings if you haven't set your size
> correctly. 90 percent target survivor ratio can actually cause very
> long minor gcs if the survivor space fills creating much more variable
> behavior than most people realize although for the most part don't
> notice. And CMS goes on and on with all these settings that really
> requires more thorough analysis and learning about each setting to an
> almost absurd level. G1 has a very small number of easily
> understandable settings to it that controls pauses and variability
> really well. It does come at a risk of throughput, but for Solr pause
> goals are far more important to me than throughout. All that said,
> I've still typically used and seen CMS in most circumstances because I
> have way more experience with it. And I think a well functioning CMS
> is more likely to have lower pauses and better throughput. Its just
> riskier that it might work much worse. I think I also don't feel like
> I know all the warts of G1 yet, so that has also kept me reticent to
> use it more.

My foray into GC tuning was sparked by seeing the load balancer take
Solr machines out of rotation every now and then, because the health
check request the load balancer was sending failed to respond within the
5 second timeout.  I could not see any reason on the Solr side for these
requests to take so long.

Early in the deployment, I had connected to the running Solr install
using jconsole, and based on what I saw in the statistics reported,
concluded that memory usage and garbage collection were working well. 
At that time, I wasn't using ANY tuning parameters, so whatever a 64-bit
JVM on Linux chooses by default is what I was getting.  I think that's
the parallel collector.

I was completely clueless for a long time about what was causing the
super-long ping requests.  Somehow, and I can no longer remember how, I
was prompted to check into GC pauses.  I started Solr with jHiccup to
get a graph of JVM pauses, and enabled GC logging.  This led to a
discovery -- when a full GC occurred, it was taking in the neighborhood
of 12 seconds on an 8GB heap.

My first attempt at remedy was to try G1GC.  I had absolutely no
knowledge of how to tune it, so all I did was just enable G1.  While
this made most collections faster than the default collector, the
worst-case full GC pause was actually even longer with un-tuned G1 -- 15
seconds was common.

I couldn't really find any info on how to tune G1, but there is a LOT of
information about tuning with CMS, and TONS of tunable settings, so I
started experiments with that.  Through a whole lot of trial and error,
most of which was not very rigorously handled, I was able to come up
with parameters that kept most of the worst-case GC pauses down below
the 5 second timeout for health checks.Those worst-case pauses were
still more expensive than I wanted to see, but at least my load balancer
wasn't taking functional servers out of rotation most of the time.

At one point, I contacted Azul Systems to inquire about costs for the
Zing JVM.  They wanted me to answer a whole lot of questions about my
install ... the sort of questions that indicated their pricing model
would probably be out of my reach.  Without the answers to those
questions, they were unwilling to give me ANY information about their
prices.  I still do not know how much Zing costs, but I suspect it's
quite pricy.

Later, mostly due to contact with the hotspot-gc-use mailing list at
openjdk, I was able to obtain a small amount of concrete information
about how to tune G1GC.  Armed with that, I did further experiments,
again not very rigorous, and came up with a set of tuning parameters
with even better characteristics than what I had with CMS.  The
ParallelRefProcEnabled option is one of the most important things to
turn on for good GC performance with Solr.  It seems that Lucene/Solr
creates a lot of references as it runs, and collecting those in parallel
offers a significant performance advantage.

The end results of my GC tuning experiments are documented on a wiki
page that has already been mentioned in this thread:

https://wiki.apache.org/solr/ShawnHeisey

The overall conclusion I have drawn about GC performance with Solr is
that full garbage collections must be entirely avoided.  No matter what
collector is in use, a full GC on a large heap is going to take a very
long time.  The generation-specific collections are typically either
concurrent or low-pause, unlike a full GC.  A well-tuned system can
handle all (or at least almost all) the necessary collections without
full GC.

Thanks,
Shawn

Reply via email to