Re: recovering mode loop

Lorenzo Fundaró Wed, 23 Sep 2015 09:19:59 -0700

I forgot some additional details:

solr version is 5.0.0
and when one of the nodes enter recovery mode the leader says this:





The current zkClientTimeout  is 15 seconds. I am gonna try to increment to
30 seconds. The process is running like this

usr/lib/jvm/java-8-oracle/bin/java -server -Xss256k -Xms16g -Xmx16g
-XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90
-XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
-XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled -verbose:gc -XX:+PrintHeapAtGC
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
-DzkClientTimeout=15000
-DzkHost=zk1.dawanda.services,zk2.dawanda.services,zk3.dawanda.services
-DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Djetty.port=8983 -Duser.timezone=UTC
-Djava.net.preferIPv4Stack=true -jar start.jar

I didnt set up any special JVM params except for the -Xms16g -Xmx16g.



On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com>
wrote:

> Wow, this is not expected at all. There's no
> way you should, on the face of it, get
> overlapping on-deck searchers.
>
> I recommend you put your maxWarmingSearchers
> back to 2, that's a fail-safe that is there to make
> people look at why they're warming a bunch of
> searchers at once.
>
> With those settings, it's saying that autowarming is
> taking over 10 minutes. This is absurdly long, so either
> something is pathologically wrong with your Solr
> or you're really committing more often than you think.
> Possibly you have a client issuing commits? You
> can look at your Solr logs and see commits, just
> look for the word "commit". When reading those lines,
> it'll say whether it has openSearcher true or false.
> Are the timestamps when openSearcer=true really
> 10 minutes apart?
>
> You'll also see autowarm times in your logs, see how
> long they really take. If they really take 10 minutes,
> we need to get to the bottom of that because the
> autowarm counts you're showing in your cache
> configurations don't indicate any problem here.
>
> Bottom line:
> 1> you shouldn't be seeing nodes go into recovery in the
> first place. Are your Solr logs showing any ERROR
> level messages?
>
> 2> it's extremely surprising that you're getting any
> overlapping on-deck searchers. If it turns out that
> your autowarming is really taking more than a few
> seconds, getting a stack trace to see where Solr is
> spending all the time is warranted.
>
> 3> Any clues from the logs _why_ they're going
> into recovery? Also look at your leader's log file
> and see if there are any messages about "leader
> initiated recovery". If you see that, then perhaps
> one of the timeouts is too short.
>
> 4> the tlog size is quite reasonable. It's only relevant
> when a node goes down for some reason anyway,
> so I wouldn't expend too much energy worrying about
> them until we get to the bottom of overlapping
> searchers and nodes going into recovery.
>
> BTW, nice job of laying out the relevant issues and
> adding supporting information! I wish more problem
> statements were as complete. If your Solr is 4.7.0,
> there was a memory problem and you should definitely
> go to 4.7.2. The symptom here is that you'll see
> Out of Memory errors...
>
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > Hi !,
> >
> > I keep getting nodes that fall into recovery mode and then issue the
> > following log WARN every 10 seconds:
> >
> > WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
> >
> > and sometimes this appears as well:
> > PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > At higher traffic time, this gets worse and out of 4 nodes only 1 is up.
> > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
> > respectively. Core A gets a lot of index updates and higher query traffic
> > compared to core B. Core A is going through active/recovery/down states
> > very often.
> > Nodes are coordinated via Zookeeper, we have three, running in different
> > machines than Solr.
> > Each machine has around 24 cores and between 38 and 48 GB of RAM, with
> each
> > Solr getting 16GB of heap memory.
> > I read this article:
> >
> >
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > and decided to apply:
> >
> >      <autoCommit>
> >        <!-- Every 15 seconds -->
> >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
> >        <openSearcher>false</openSearcher>
> >      </autoCommit>
> >
> > and
> >
> >      <autoSoftCommit>
> >        <!-- Every 10 minutes -->
> >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
> >      </autoSoftCommit>
> >
> > I also have these cache configurations:
> >
> >     <filterCache class="solr.LFUCache"
> >                  size="64"
> >                  initialSize="64"
> >                  autowarmCount="32"/>
> >
> >     <queryResultCache class="solr.LRUCache"
> >                      size="512"
> >                      initialSize="512"
> >                      autowarmCount="0"/>
> >
> >     <documentCache class="solr.LRUCache"
> >                    size="1024"
> >                    initialSize="1024"
> >                    autowarmCount="0"/>
> >
> >     <cache name="perSegFilter"
> >       class="solr.search.LRUCache"
> >       size="10"
> >       initialSize="0"
> >       autowarmCount="10"
> >       regenerator="solr.NoOpRegenerator" />
> >
> >        <fieldValueCache class="solr.FastLRUCache"
> >                         size="512"
> >                         autowarmCount="0"
> >                         showItems="32" />
> >
> > I also have this:
> > <maxWarmingSearchers>6</maxWarmingSearchers>
> > The size of the tlogs are usually between 1MB to 8MB.
> > I thought the changes above could improve the situation, but I am not
> 100%
> > convinced they did since after 15 min one of the nodes entered recovery
> > mode again.
> >
> > any ideas ?
> >
> > Thanks in advance.
> >
> > Cheers !
> >
> > --
> >
> > --
> > Lorenzo Fundaro
> > Backend Engineer
> > E-Mail: lorenzo.fund...@dawandamail.com
> >
> > Fax       + 49 - (0)30 - 25 76 08 52
> > Tel        + 49 - (0)179 - 51 10 982
> >
> > DaWanda GmbH
> > Windscheidstraße 18
> > 10627 Berlin
> >
> > Geschäftsführer: Claudia Helming, Michael Pütz
> > Amtsgericht Charlottenburg HRB 104695 B
> >
>



-- 

-- 
Lorenzo Fundaro
Backend Engineer
E-Mail: lorenzo.fund...@dawandamail.com

Fax       + 49 - (0)30 - 25 76 08 52
Tel        + 49 - (0)179 - 51 10 982

DaWanda GmbH
Windscheidstraße 18
10627 Berlin

Geschäftsführer: Claudia Helming, Michael Pütz
Amtsgericht Charlottenburg HRB 104695 B

Re: recovering mode loop

Reply via email to