Re: recovering mode loop

Lorenzo Fundaró Wed, 23 Sep 2015 09:29:19 -0700

On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com>
wrote:


> Wow, this is not expected at all. There's no
> way you should, on the face of it, get
> overlapping on-deck searchers.
>
> I recommend you put your maxWarmingSearchers
> back to 2, that's a fail-safe that is there to make
> people look at why they're warming a bunch of
> searchers at once.
>

Ok, will do.


>
> With those settings, it's saying that autowarming is
> taking over 10 minutes.


What do you mean ? I don't think the autowarming is taking 10 minutes here.
Do you maybe mean 10 seconds apart ?
Every 10 min I issue a soft commit which would create a new searcher
with several autowarmings, one for each cache. Looking at the stats the
slowest autowarm is from filter cache and it takes around 2 seconds.


> This is absurdly long, so either
> something is pathologically wrong with your Solr
> or you're really committing more often than you think.
> Possibly you have a client issuing commits?


mm no, I checked my client and I am not doing any commit.
Here are some logs from the leader:

INFO  - 2015-09-23 16:21:59.803;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2
commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109}
commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
newest commit generation = 24110
INFO  - 2015-09-23 16:22:01.010;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-09-23 16:22:16.967;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-09-23 16:22:18.452; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2
commit{dir=NRTCachingDirectory(MMapDirectory@/srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}

You can look at your Solr logs and see commits, just
> look for the word "commit". When reading those lines,
> it'll say whether it has openSearcher true or false.
> Are the timestamps when openSearcer=true really
> 10 minutes apart?
>

You mean 10 seconds apart ?


>
> You'll also see autowarm times in your logs, see how
> long they really take. If they really take 10 minutes,
> we need to get to the bottom of that because the
> autowarm counts you're showing in your cache
> configurations don't indicate any problem here.
>
> Bottom line:
> 1> you shouldn't be seeing nodes go into recovery in the
> first place. Are your Solr logs showing any ERROR
> level messages?
>
> 2> it's extremely surprising that you're getting any
> overlapping on-deck searchers. If it turns out that
> your autowarming is really taking more than a few
> seconds, getting a stack trace to see where Solr is
> spending all the time is warranted.
>

Will do.

>
> 3> Any clues from the logs _why_ they're going
> into recovery? Also look at your leader's log file
> and see if there are any messages about "leader
> initiated recovery". If you see that, then perhaps
> one of the timeouts is too short.
>
> 4> the tlog size is quite reasonable. It's only relevant
> when a node goes down for some reason anyway,
> so I wouldn't expend too much energy worrying about
> them until we get to the bottom of overlapping
> searchers and nodes going into recovery.
>
> BTW, nice job of laying out the relevant issues and
> adding supporting information! I wish more problem
> statements were as complete. If your Solr is 4.7.0,
> there was a memory problem and you should definitely
> go to 4.7.2. The symptom here is that you'll see
> Out of Memory errors...
>
>
> Best,
> Erick


Thank you very much !

>


> On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > Hi !,
> >
> > I keep getting nodes that fall into recovery mode and then issue the
> > following log WARN every 10 seconds:
> >
> > WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
> >
> > and sometimes this appears as well:
> > PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > At higher traffic time, this gets worse and out of 4 nodes only 1 is up.
> > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
> > respectively. Core A gets a lot of index updates and higher query traffic
> > compared to core B. Core A is going through active/recovery/down states
> > very often.
> > Nodes are coordinated via Zookeeper, we have three, running in different
> > machines than Solr.
> > Each machine has around 24 cores and between 38 and 48 GB of RAM, with
> each
> > Solr getting 16GB of heap memory.
> > I read this article:
> >
> >
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > and decided to apply:
> >
> >      <autoCommit>
> >        <!-- Every 15 seconds -->
> >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
> >        <openSearcher>false</openSearcher>
> >      </autoCommit>
> >
> > and
> >
> >      <autoSoftCommit>
> >        <!-- Every 10 minutes -->
> >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
> >      </autoSoftCommit>
> >
> > I also have these cache configurations:
> >
> >     <filterCache class="solr.LFUCache"
> >                  size="64"
> >                  initialSize="64"
> >                  autowarmCount="32"/>
> >
> >     <queryResultCache class="solr.LRUCache"
> >                      size="512"
> >                      initialSize="512"
> >                      autowarmCount="0"/>
> >
> >     <documentCache class="solr.LRUCache"
> >                    size="1024"
> >                    initialSize="1024"
> >                    autowarmCount="0"/>
> >
> >     <cache name="perSegFilter"
> >       class="solr.search.LRUCache"
> >       size="10"
> >       initialSize="0"
> >       autowarmCount="10"
> >       regenerator="solr.NoOpRegenerator" />
> >
> >        <fieldValueCache class="solr.FastLRUCache"
> >                         size="512"
> >                         autowarmCount="0"
> >                         showItems="32" />
> >
> > I also have this:
> > <maxWarmingSearchers>6</maxWarmingSearchers>
> > The size of the tlogs are usually between 1MB to 8MB.
> > I thought the changes above could improve the situation, but I am not
> 100%
> > convinced they did since after 15 min one of the nodes entered recovery
> > mode again.
> >
> > any ideas ?
> >
> > Thanks in advance.
> >
> > Cheers !
> >
> > --
> >
> > --
> > Lorenzo Fundaro
> > Backend Engineer
> > E-Mail: lorenzo.fund...@dawandamail.com
> >
> > Fax       + 49 - (0)30 - 25 76 08 52
> > Tel        + 49 - (0)179 - 51 10 982
> >
> > DaWanda GmbH
> > Windscheidstraße 18
> > 10627 Berlin
> >
> > Geschäftsführer: Claudia Helming, Michael Pütz
> > Amtsgericht Charlottenburg HRB 104695 B
> >
>



-- 

-- 
Lorenzo Fundaro
Backend Engineer
E-Mail: lorenzo.fund...@dawandamail.com

Fax       + 49 - (0)30 - 25 76 08 52
Tel        + 49 - (0)179 - 51 10 982

DaWanda GmbH
Windscheidstraße 18
10627 Berlin

Geschäftsführer: Claudia Helming, Michael Pütz
Amtsgericht Charlottenburg HRB 104695 B

Re: recovering mode loop

Reply via email to