I forgot some additional details: solr version is 5.0.0 and when one of the nodes enter recovery mode the leader says this:
The current zkClientTimeout is 15 seconds. I am gonna try to increment to 30 seconds. The process is running like this usr/lib/jvm/java-8-oracle/bin/java -server -Xss256k -Xms16g -Xmx16g -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -DzkClientTimeout=15000 -DzkHost=zk1.dawanda.services,zk2.dawanda.services,zk3.dawanda.services -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Djetty.port=8983 -Duser.timezone=UTC -Djava.net.preferIPv4Stack=true -jar start.jar I didnt set up any special JVM params except for the -Xms16g -Xmx16g. On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com> wrote: > Wow, this is not expected at all. There's no > way you should, on the face of it, get > overlapping on-deck searchers. > > I recommend you put your maxWarmingSearchers > back to 2, that's a fail-safe that is there to make > people look at why they're warming a bunch of > searchers at once. > > With those settings, it's saying that autowarming is > taking over 10 minutes. This is absurdly long, so either > something is pathologically wrong with your Solr > or you're really committing more often than you think. > Possibly you have a client issuing commits? You > can look at your Solr logs and see commits, just > look for the word "commit". When reading those lines, > it'll say whether it has openSearcher true or false. > Are the timestamps when openSearcer=true really > 10 minutes apart? > > You'll also see autowarm times in your logs, see how > long they really take. If they really take 10 minutes, > we need to get to the bottom of that because the > autowarm counts you're showing in your cache > configurations don't indicate any problem here. > > Bottom line: > 1> you shouldn't be seeing nodes go into recovery in the > first place. Are your Solr logs showing any ERROR > level messages? > > 2> it's extremely surprising that you're getting any > overlapping on-deck searchers. If it turns out that > your autowarming is really taking more than a few > seconds, getting a stack trace to see where Solr is > spending all the time is warranted. > > 3> Any clues from the logs _why_ they're going > into recovery? Also look at your leader's log file > and see if there are any messages about "leader > initiated recovery". If you see that, then perhaps > one of the timeouts is too short. > > 4> the tlog size is quite reasonable. It's only relevant > when a node goes down for some reason anyway, > so I wouldn't expend too much energy worrying about > them until we get to the bottom of overlapping > searchers and nodes going into recovery. > > BTW, nice job of laying out the relevant issues and > adding supporting information! I wish more problem > statements were as complete. If your Solr is 4.7.0, > there was a memory problem and you should definitely > go to 4.7.2. The symptom here is that you'll see > Out of Memory errors... > > > Best, > Erick > > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró < > lorenzo.fund...@dawandamail.com> wrote: > > > Hi !, > > > > I keep getting nodes that fall into recovery mode and then issue the > > following log WARN every 10 seconds: > > > > WARN Stopping recovery for core=xxxx coreNodeName=core_node7 > > > > and sometimes this appears as well: > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > > At higher traffic time, this gets worse and out of 4 nodes only 1 is up. > > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB > > respectively. Core A gets a lot of index updates and higher query traffic > > compared to core B. Core A is going through active/recovery/down states > > very often. > > Nodes are coordinated via Zookeeper, we have three, running in different > > machines than Solr. > > Each machine has around 24 cores and between 38 and 48 GB of RAM, with > each > > Solr getting 16GB of heap memory. > > I read this article: > > > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > and decided to apply: > > > > <autoCommit> > > <!-- Every 15 seconds --> > > <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> > > <openSearcher>false</openSearcher> > > </autoCommit> > > > > and > > > > <autoSoftCommit> > > <!-- Every 10 minutes --> > > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime> > > </autoSoftCommit> > > > > I also have these cache configurations: > > > > <filterCache class="solr.LFUCache" > > size="64" > > initialSize="64" > > autowarmCount="32"/> > > > > <queryResultCache class="solr.LRUCache" > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > <documentCache class="solr.LRUCache" > > size="1024" > > initialSize="1024" > > autowarmCount="0"/> > > > > <cache name="perSegFilter" > > class="solr.search.LRUCache" > > size="10" > > initialSize="0" > > autowarmCount="10" > > regenerator="solr.NoOpRegenerator" /> > > > > <fieldValueCache class="solr.FastLRUCache" > > size="512" > > autowarmCount="0" > > showItems="32" /> > > > > I also have this: > > <maxWarmingSearchers>6</maxWarmingSearchers> > > The size of the tlogs are usually between 1MB to 8MB. > > I thought the changes above could improve the situation, but I am not > 100% > > convinced they did since after 15 min one of the nodes entered recovery > > mode again. > > > > any ideas ? > > > > Thanks in advance. > > > > Cheers ! > > > > -- > > > > -- > > Lorenzo Fundaro > > Backend Engineer > > E-Mail: lorenzo.fund...@dawandamail.com > > > > Fax + 49 - (0)30 - 25 76 08 52 > > Tel + 49 - (0)179 - 51 10 982 > > > > DaWanda GmbH > > Windscheidstraße 18 > > 10627 Berlin > > > > Geschäftsführer: Claudia Helming, Michael Pütz > > Amtsgericht Charlottenburg HRB 104695 B > > > -- -- Lorenzo Fundaro Backend Engineer E-Mail: lorenzo.fund...@dawandamail.com Fax + 49 - (0)30 - 25 76 08 52 Tel + 49 - (0)179 - 51 10 982 DaWanda GmbH Windscheidstraße 18 10627 Berlin Geschäftsführer: Claudia Helming, Michael Pütz Amtsgericht Charlottenburg HRB 104695 B