On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com> wrote:
> Wow, this is not expected at all. There's no > way you should, on the face of it, get > overlapping on-deck searchers. > > I recommend you put your maxWarmingSearchers > back to 2, that's a fail-safe that is there to make > people look at why they're warming a bunch of > searchers at once. > Ok, will do. > > With those settings, it's saying that autowarming is > taking over 10 minutes. What do you mean ? I don't think the autowarming is taking 10 minutes here. Do you maybe mean 10 seconds apart ? Every 10 min I issue a soft commit which would create a new searcher with several autowarmings, one for each cache. Looking at the stats the slowest autowarm is from filter cache and it takes around 2 seconds. > This is absurdly long, so either > something is pathologically wrong with your Solr > or you're really committing more often than you think. > Possibly you have a client issuing commits? mm no, I checked my client and I am not doing any commit. Here are some logs from the leader: INFO - 2015-09-23 16:21:59.803; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy; SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109} commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110} INFO - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy; newest commit generation = 24110 INFO - 2015-09-23 16:22:01.010; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush INFO - 2015-09-23 16:22:16.967; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2015-09-23 16:22:18.452; org.apache.solr.core.SolrDeletionPolicy; SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110} You can look at your Solr logs and see commits, just > look for the word "commit". When reading those lines, > it'll say whether it has openSearcher true or false. > Are the timestamps when openSearcer=true really > 10 minutes apart? > You mean 10 seconds apart ? > > You'll also see autowarm times in your logs, see how > long they really take. If they really take 10 minutes, > we need to get to the bottom of that because the > autowarm counts you're showing in your cache > configurations don't indicate any problem here. > > Bottom line: > 1> you shouldn't be seeing nodes go into recovery in the > first place. Are your Solr logs showing any ERROR > level messages? > > 2> it's extremely surprising that you're getting any > overlapping on-deck searchers. If it turns out that > your autowarming is really taking more than a few > seconds, getting a stack trace to see where Solr is > spending all the time is warranted. > Will do. > > 3> Any clues from the logs _why_ they're going > into recovery? Also look at your leader's log file > and see if there are any messages about "leader > initiated recovery". If you see that, then perhaps > one of the timeouts is too short. > > 4> the tlog size is quite reasonable. It's only relevant > when a node goes down for some reason anyway, > so I wouldn't expend too much energy worrying about > them until we get to the bottom of overlapping > searchers and nodes going into recovery. > > BTW, nice job of laying out the relevant issues and > adding supporting information! I wish more problem > statements were as complete. If your Solr is 4.7.0, > there was a memory problem and you should definitely > go to 4.7.2. The symptom here is that you'll see > Out of Memory errors... > > > Best, > Erick Thank you very much ! > > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró < > lorenzo.fund...@dawandamail.com> wrote: > > > Hi !, > > > > I keep getting nodes that fall into recovery mode and then issue the > > following log WARN every 10 seconds: > > > > WARN Stopping recovery for core=xxxx coreNodeName=core_node7 > > > > and sometimes this appears as well: > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > > At higher traffic time, this gets worse and out of 4 nodes only 1 is up. > > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB > > respectively. Core A gets a lot of index updates and higher query traffic > > compared to core B. Core A is going through active/recovery/down states > > very often. > > Nodes are coordinated via Zookeeper, we have three, running in different > > machines than Solr. > > Each machine has around 24 cores and between 38 and 48 GB of RAM, with > each > > Solr getting 16GB of heap memory. > > I read this article: > > > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > and decided to apply: > > > > <autoCommit> > > <!-- Every 15 seconds --> > > <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> > > <openSearcher>false</openSearcher> > > </autoCommit> > > > > and > > > > <autoSoftCommit> > > <!-- Every 10 minutes --> > > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime> > > </autoSoftCommit> > > > > I also have these cache configurations: > > > > <filterCache class="solr.LFUCache" > > size="64" > > initialSize="64" > > autowarmCount="32"/> > > > > <queryResultCache class="solr.LRUCache" > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > <documentCache class="solr.LRUCache" > > size="1024" > > initialSize="1024" > > autowarmCount="0"/> > > > > <cache name="perSegFilter" > > class="solr.search.LRUCache" > > size="10" > > initialSize="0" > > autowarmCount="10" > > regenerator="solr.NoOpRegenerator" /> > > > > <fieldValueCache class="solr.FastLRUCache" > > size="512" > > autowarmCount="0" > > showItems="32" /> > > > > I also have this: > > <maxWarmingSearchers>6</maxWarmingSearchers> > > The size of the tlogs are usually between 1MB to 8MB. > > I thought the changes above could improve the situation, but I am not > 100% > > convinced they did since after 15 min one of the nodes entered recovery > > mode again. > > > > any ideas ? > > > > Thanks in advance. > > > > Cheers ! > > > > -- > > > > -- > > Lorenzo Fundaro > > Backend Engineer > > E-Mail: lorenzo.fund...@dawandamail.com > > > > Fax + 49 - (0)30 - 25 76 08 52 > > Tel + 49 - (0)179 - 51 10 982 > > > > DaWanda GmbH > > Windscheidstraße 18 > > 10627 Berlin > > > > Geschäftsführer: Claudia Helming, Michael Pütz > > Amtsgericht Charlottenburg HRB 104695 B > > > -- -- Lorenzo Fundaro Backend Engineer E-Mail: lorenzo.fund...@dawandamail.com Fax + 49 - (0)30 - 25 76 08 52 Tel + 49 - (0)179 - 51 10 982 DaWanda GmbH Windscheidstraße 18 10627 Berlin Geschäftsführer: Claudia Helming, Michael Pütz Amtsgericht Charlottenburg HRB 104695 B