We index lots of relatively small documents, minimum of around 6k/second, but up to 20k/second. At the same time we are deleting 400-900 documents a second. We have our shards organized by time, so the bulk of our indexing happens in one 'hot' shard, but deletes can go back in time to our epoch.
Recently I turned on INFO level logging in order to get better insight as to what our Solr cluster is doing. Sometimes as frequently as almost 3 times a second we get messages like: [CMS][qtp896644936-33133]: too many merges; stalling... Less frequently we get: [TMP][commitScheduler-8-thread-1]: seg=_5dy(4.10.3):C13520226/1044084:delGen=318 size=2784.291 MB [skip: too large] where size is 2500-4900MB. Am I correct in assuming that CMS is getting overwhelmed by merge activity given these log statements? I notice index data files grow up to ~1000 (~50-70GB on disk), where a non-actively indexed shard will generally use around ~400-450 data files in this SolrCloud. Also, transaction logs tend to accumulate, I suspect in relation to how far behind CMS gets. We are using default TieredMergePolicy on Solr 4.10.3. We have mergeFactor set to 5. I notice maxMergedSegmentBytes defaults to 5GB, has anyone had any success (or horror stories) trying to tune this value? Should we be looking into custom merge policies at our indexing rate? Any advice for getting better performance out of merging, or work in progress in this area? Worst case, are there any metrics to look at to monitor these sorts of situations? It seems like I need to parse log files to get a useful set of metrics data... One thought I had was deferring deletes until our 'hot' shard rotates out of its active indexing time window, I suspect that may make a large enough difference but I need to see whether we can satisfy our business rule constraints to accommodate this. https://issues.apache.org/jira/browse/SOLR-6816 and https://issues.apache.org/jira/browse/SOLR-6838 and https://issues.apache.org/jira/browse/LUCENE-6161 seem relevant, we set ramBufferSizeMB to 256 but I don't know that this is the same setting as described in the LUCENE issue. Thanks for any thoughts, --Ralph