<1> It's not the explicit commits are expensive, it's that they happen too fast. An explicit commit and an internal autocommit have exactly the same cost. Your "overlapping ondeck searchers" is definitely an indication that your commits are happening from somwhere too quickly and are piling up.
<2> Likely a good thing, each collection increases overhead. And 1,000,000 documents is quite small in Solr's terms unless the individual documents are enormous. I'd do this for a number of reasons. <3> Certainly an option, but I'd put that last. Fix the commit problem first ;) <4> If you do this, make the autowarm count quite small. That said, this will be very little use if you have frequent commits. Let's say you commit every second. The autowarming will warm caches, which will then be thrown out a second later. And will increase the time it takes to open a new searcher. <5> Yeah, this would probably just be a band-aid. If I were prioritizing these, I'd do <1> first. If you control the client, just don't call commit. If you do not control the client, then what you've outlined is fine. Tip: set your soft commit settings to be as long as you can stand. If you must have very short intervals, consider disabling your caches completely. Here's a long article on commits.... https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ <2> Actually, this and <1> are pretty close in priority. Then re-evaluate. Fixing the commit issue may buy you quite a bit of time. Having 1,000 collections is pushing the boundaries presently. Each collection will establish watchers on the bits it cares about in ZooKeeper, and reducing the watchers by a factor approaching 1,000 is A Good Thing. Frankly, between these two things I'd pretty much expect your problems to disappear. wouldn't be the first time I've been totally wrong, but it's where I'd start ;) Best, Erick On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <fengtan...@gmail.com> wrote: > Hi, > > We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's. > Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each > VM has one Solr and one ZK instance. > The cluster hosts 1,000 collections ; each collection has 1 shard and > between 500 and 50,000 documents. > Documents are indexed incrementally every day ; the Solr client mostly does > searching. > Solr runs with -Xms7g -Xmx7g. > > Everything has been working fine for about one month but a few days ago we > started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm > > Also we have always seen these: > PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > > > We are not sure what is causing the timeouts, although we have identified a > few things that could be improved: > > 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory > -- we are aware that explicit commits are expensive > > 2) Drop the 1,000 collections and use a single one instead (all our > collections use the same schema/solrconfig.xml) since stability problems > are expected when the number of collections reaches the low hundreds > <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The > downside is that the new collection would contain 1,000,000 documents which > may bring new challenges. > > 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a > better performance according to this > <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>, > this > <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector> > and this > <http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>. > The downside is that Lucene explicitely discourages the usage of G1 > <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr> > so we are not sure what to expect. We use the default GC settings: > -XX:NewRatio=3 > -XX:SurvivorRatio=4 > -XX:TargetSurvivorRatio=90 > -XX:MaxTenuringThreshold=8 > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:ConcGCThreads=4 > -XX:ParallelGCThreads=4 > -XX:+CMSScavengeBeforeRemark > -XX:PretenureSizeThreshold=64m > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSInitiatingOccupancyFraction=50 > -XX:CMSMaxAbortablePrecleanTime=6000 > -XX:+CMSParallelRemarkEnabled > -XX:+ParallelRefProcEnabled > > 4) Tune the caches, possibly by increasing autowarmCount on filterCache -- > our current config is: > <filterCache class="solr.FastLRUCache" size="512" initialSize="512" > autowarmCount="0"/> > <queryResultCache class="solr.LRUCache" size="512" initialSize="512" > autowarmCount="32"/> > <documentCache class="solr.LRUCache" size="512" initialSize="512" > autowarmCount="0"/> > > 5) Tweak the timeout settings, although this would not fix the underlying > issue > > > Does any of these options seem relevant ? Is there anything else that might > address the timeouts ? > > Thanks