I guess my first question is why you're splitting up your shards this way. You may have very good reasons, but as you outline, a huge amount of your work is on a single shard.
Is it even possible to spread docs randomly instead, thus spread the load over the entire cluster rather than the hot shard? As you can probably tell, I don't have much specific to say, but LUCENE-6161 seems quite possible in your situation. Of course turning off deletes would pinpoint whether that's really the problem. Best, Erick On Mon, Feb 16, 2015 at 7:12 PM, ralph tice <ralph.t...@gmail.com> wrote: > We index lots of relatively small documents, minimum of around > 6k/second, but up to 20k/second. At the same time we are deleting > 400-900 documents a second. We have our shards organized by time, so > the bulk of our indexing happens in one 'hot' shard, but deletes can > go back in time to our epoch. > > Recently I turned on INFO level logging in order to get better insight > as to what our Solr cluster is doing. Sometimes as frequently as > almost 3 times a second we get messages like: > [CMS][qtp896644936-33133]: too many merges; stalling... > > Less frequently we get: > [TMP][commitScheduler-8-thread-1]: > seg=_5dy(4.10.3):C13520226/1044084:delGen=318 size=2784.291 MB [skip: > too large] > > where size is 2500-4900MB. > > Am I correct in assuming that CMS is getting overwhelmed by merge > activity given these log statements? I notice index data files grow > up to ~1000 (~50-70GB on disk), where a non-actively indexed shard > will generally use around ~400-450 data files in this SolrCloud. > Also, transaction logs tend to accumulate, I suspect in relation to > how far behind CMS gets. > > We are using default TieredMergePolicy on Solr 4.10.3. We have > mergeFactor set to 5. I notice maxMergedSegmentBytes defaults to 5GB, > has anyone had any success (or horror stories) trying to tune this > value? Should we be looking into custom merge policies at our > indexing rate? Any advice for getting better performance out of > merging, or work in progress in this area? Worst case, are there any > metrics to look at to monitor these sorts of situations? It seems > like I need to parse log files to get a useful set of metrics data... > > One thought I had was deferring deletes until our 'hot' shard rotates > out of its active indexing time window, I suspect that may make a > large enough difference but I need to see whether we can satisfy our > business rule constraints to accommodate this. > > https://issues.apache.org/jira/browse/SOLR-6816 and > https://issues.apache.org/jira/browse/SOLR-6838 and > https://issues.apache.org/jira/browse/LUCENE-6161 seem relevant, we > set ramBufferSizeMB to 256 but I don't know that this is the same > setting as described in the LUCENE issue. > > Thanks for any thoughts, > > --Ralph