Hi Fengtan, I would just add that when merging collections, you might want to use document routing (https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting <https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>) - since you are keeping separate collections, I guess you have a “collection ID” to use as routing key. This will enable you to have one collection but query only shard(s) with data from one “collection”.
HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 25 Oct 2017, at 19:25, Erick Erickson <erickerick...@gmail.com> wrote: > > <1> It's not the explicit commits are expensive, it's that they happen > too fast. An explicit commit and an internal autocommit have exactly > the same cost. Your "overlapping ondeck searchers" is definitely an > indication that your commits are happening from somwhere too quickly > and are piling up. > > <2> Likely a good thing, each collection increases overhead. And > 1,000,000 documents is quite small in Solr's terms unless the > individual documents are enormous. I'd do this for a number of > reasons. > > <3> Certainly an option, but I'd put that last. Fix the commit problem first > ;) > > <4> If you do this, make the autowarm count quite small. That said, > this will be very little use if you have frequent commits. Let's say > you commit every second. The autowarming will warm caches, which will > then be thrown out a second later. And will increase the time it takes > to open a new searcher. > > <5> Yeah, this would probably just be a band-aid. > > If I were prioritizing these, I'd do > <1> first. If you control the client, just don't call commit. If you > do not control the client, then what you've outlined is fine. Tip: set > your soft commit settings to be as long as you can stand. If you must > have very short intervals, consider disabling your caches completely. > Here's a long article on commits.... > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > <2> Actually, this and <1> are pretty close in priority. > > Then re-evaluate. Fixing the commit issue may buy you quite a bit of > time. Having 1,000 collections is pushing the boundaries presently. > Each collection will establish watchers on the bits it cares about in > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is > A Good Thing. > > Frankly, between these two things I'd pretty much expect your problems > to disappear. wouldn't be the first time I've been totally wrong, but > it's where I'd start ;) > > Best, > Erick > > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <fengtan...@gmail.com> wrote: >> Hi, >> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's. >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each >> VM has one Solr and one ZK instance. >> The cluster hosts 1,000 collections ; each collection has 1 shard and >> between 500 and 50,000 documents. >> Documents are indexed incrementally every day ; the Solr client mostly does >> searching. >> Solr runs with -Xms7g -Xmx7g. >> >> Everything has been working fine for about one month but a few days ago we >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm >> >> Also we have always seen these: >> PERFORMANCE WARNING: Overlapping onDeckSearchers=2 >> >> >> We are not sure what is causing the timeouts, although we have identified a >> few things that could be improved: >> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory >> -- we are aware that explicit commits are expensive >> >> 2) Drop the 1,000 collections and use a single one instead (all our >> collections use the same schema/solrconfig.xml) since stability problems >> are expected when the number of collections reaches the low hundreds >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The >> downside is that the new collection would contain 1,000,000 documents which >> may bring new challenges. >> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a >> better performance according to this >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>, >> this >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector> >> and this >> <http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>. >> The downside is that Lucene explicitely discourages the usage of G1 >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr> >> so we are not sure what to expect. We use the default GC settings: >> -XX:NewRatio=3 >> -XX:SurvivorRatio=4 >> -XX:TargetSurvivorRatio=90 >> -XX:MaxTenuringThreshold=8 >> -XX:+UseConcMarkSweepGC >> -XX:+UseParNewGC >> -XX:ConcGCThreads=4 >> -XX:ParallelGCThreads=4 >> -XX:+CMSScavengeBeforeRemark >> -XX:PretenureSizeThreshold=64m >> -XX:+UseCMSInitiatingOccupancyOnly >> -XX:CMSInitiatingOccupancyFraction=50 >> -XX:CMSMaxAbortablePrecleanTime=6000 >> -XX:+CMSParallelRemarkEnabled >> -XX:+ParallelRefProcEnabled >> >> 4) Tune the caches, possibly by increasing autowarmCount on filterCache -- >> our current config is: >> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" >> autowarmCount="0"/> >> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" >> autowarmCount="32"/> >> <documentCache class="solr.LRUCache" size="512" initialSize="512" >> autowarmCount="0"/> >> >> 5) Tweak the timeout settings, although this would not fix the underlying >> issue >> >> >> Does any of these options seem relevant ? Is there anything else that might >> address the timeouts ? >> >> Thanks