We index lots of relatively small documents, minimum of around
6k/second, but up to 20k/second.  At the same time we are deleting
400-900 documents a second.  We have our shards organized by time, so
the bulk of our indexing happens in one 'hot' shard, but deletes can
go back in time to our epoch.

Recently I turned on INFO level logging in order to get better insight
as to what our Solr cluster is doing.  Sometimes as frequently as
almost 3 times a second we get messages like:
[CMS][qtp896644936-33133]: too many merges; stalling...

Less frequently we get:
[TMP][commitScheduler-8-thread-1]:
seg=_5dy(4.10.3):C13520226/1044084:delGen=318 size=2784.291 MB [skip:
too large]

where size is 2500-4900MB.

Am I correct in assuming that CMS is getting overwhelmed by merge
activity given these log statements?  I notice index data files grow
up to ~1000 (~50-70GB on disk), where a non-actively indexed shard
will generally use around ~400-450 data files in this SolrCloud.
Also, transaction logs tend to accumulate, I suspect in relation to
how far behind CMS gets.

We are using default TieredMergePolicy on Solr 4.10.3.  We have
mergeFactor set to 5.  I notice maxMergedSegmentBytes defaults to 5GB,
has anyone had any success (or horror stories) trying to tune this
value?  Should we be looking into custom merge policies at our
indexing rate?  Any advice for getting better performance out of
merging, or work in progress in this area? Worst case, are there any
metrics to look at to monitor these sorts of situations?  It seems
like I need to parse log files to get a useful set of metrics data...

One thought I had was deferring deletes until our 'hot' shard rotates
out of its active indexing time window, I suspect that may make a
large enough difference but I need to see whether we can satisfy our
business rule constraints to accommodate this.

https://issues.apache.org/jira/browse/SOLR-6816 and
https://issues.apache.org/jira/browse/SOLR-6838 and
https://issues.apache.org/jira/browse/LUCENE-6161 seem relevant, we
set ramBufferSizeMB to 256 but I don't know that this is the same
setting as described in the LUCENE issue.

Thanks for any thoughts,

--Ralph

Reply via email to