I guess my first question is why you're splitting up your shards this way. You
may have very good reasons, but as you outline, a huge amount of your work is
on a single shard.

Is it even possible to spread docs randomly instead, thus spread the load
over the entire cluster rather than the hot shard?

As you can probably tell, I don't have much specific to say, but LUCENE-6161
seems quite possible in your situation. Of course turning off deletes would
pinpoint whether that's really the problem.

Best,
Erick

On Mon, Feb 16, 2015 at 7:12 PM, ralph tice <ralph.t...@gmail.com> wrote:
> We index lots of relatively small documents, minimum of around
> 6k/second, but up to 20k/second.  At the same time we are deleting
> 400-900 documents a second.  We have our shards organized by time, so
> the bulk of our indexing happens in one 'hot' shard, but deletes can
> go back in time to our epoch.
>
> Recently I turned on INFO level logging in order to get better insight
> as to what our Solr cluster is doing.  Sometimes as frequently as
> almost 3 times a second we get messages like:
> [CMS][qtp896644936-33133]: too many merges; stalling...
>
> Less frequently we get:
> [TMP][commitScheduler-8-thread-1]:
> seg=_5dy(4.10.3):C13520226/1044084:delGen=318 size=2784.291 MB [skip:
> too large]
>
> where size is 2500-4900MB.
>
> Am I correct in assuming that CMS is getting overwhelmed by merge
> activity given these log statements?  I notice index data files grow
> up to ~1000 (~50-70GB on disk), where a non-actively indexed shard
> will generally use around ~400-450 data files in this SolrCloud.
> Also, transaction logs tend to accumulate, I suspect in relation to
> how far behind CMS gets.
>
> We are using default TieredMergePolicy on Solr 4.10.3.  We have
> mergeFactor set to 5.  I notice maxMergedSegmentBytes defaults to 5GB,
> has anyone had any success (or horror stories) trying to tune this
> value?  Should we be looking into custom merge policies at our
> indexing rate?  Any advice for getting better performance out of
> merging, or work in progress in this area? Worst case, are there any
> metrics to look at to monitor these sorts of situations?  It seems
> like I need to parse log files to get a useful set of metrics data...
>
> One thought I had was deferring deletes until our 'hot' shard rotates
> out of its active indexing time window, I suspect that may make a
> large enough difference but I need to see whether we can satisfy our
> business rule constraints to accommodate this.
>
> https://issues.apache.org/jira/browse/SOLR-6816 and
> https://issues.apache.org/jira/browse/SOLR-6838 and
> https://issues.apache.org/jira/browse/LUCENE-6161 seem relevant, we
> set ramBufferSizeMB to 256 but I don't know that this is the same
> setting as described in the LUCENE issue.
>
> Thanks for any thoughts,
>
> --Ralph

Reply via email to