Hi,

We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each
VM has one Solr and one ZK instance.
The cluster hosts 1,000 collections ; each collection has 1 shard and
between 500 and 50,000 documents.
Documents are indexed incrementally every day ; the Solr client mostly does
searching.
Solr runs with -Xms7g -Xmx7g.

Everything has been working fine for about one month but a few days ago we
started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm

Also we have always seen these:
  PERFORMANCE WARNING: Overlapping onDeckSearchers=2


We are not sure what is causing the timeouts, although we have identified a
few things that could be improved:

1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory
-- we are aware that explicit commits are expensive

2) Drop the 1,000 collections and use a single one instead (all our
collections use the same schema/solrconfig.xml) since stability problems
are expected when the number of collections reaches the low hundreds
<https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
downside is that the new collection would contain 1,000,000 documents which
may bring new challenges.

3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
better performance according to this
<https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>,
this
<https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector>
and this
<http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
The downside is that Lucene explicitely discourages the usage of G1
<https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr>
so we are not sure what to expect. We use the default GC settings:
  -XX:NewRatio=3
  -XX:SurvivorRatio=4
  -XX:TargetSurvivorRatio=90
  -XX:MaxTenuringThreshold=8
  -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC
  -XX:ConcGCThreads=4
  -XX:ParallelGCThreads=4
  -XX:+CMSScavengeBeforeRemark
  -XX:PretenureSizeThreshold=64m
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:CMSInitiatingOccupancyFraction=50
  -XX:CMSMaxAbortablePrecleanTime=6000
  -XX:+CMSParallelRemarkEnabled
  -XX:+ParallelRefProcEnabled

4) Tune the caches, possibly by increasing autowarmCount on filterCache --
our current config is:
  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
autowarmCount="0"/>
  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
autowarmCount="32"/>
  <documentCache class="solr.LRUCache" size="512" initialSize="512"
autowarmCount="0"/>

5) Tweak the timeout settings, although this would not fix the underlying
issue


Does any of these options seem relevant ? Is there anything else that might
address the timeouts ?

Thanks

Reply via email to