Re: recovering mode loop

Erick Erickson Wed, 23 Sep 2015 14:34:35 -0700

bq: 9/23/2015, 7:14:49 PMWARNZkControllerLeader is publishing core=dawanda
coreNodeName =core_node10 state=down on behalf of un-reachable replica


Ok, this brings up a different possibility. If you happen to be indexing at
a very high rate there's some possibility that the followers get so busy
that they use up many many threads and don't respond to the leader. See:
https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
where Tim Potter describes the problem. I've seen this happen in cases
where there were 150 client threads indexing to a solar cluster at once so
the indexing rate must be very very high.

Another possibility is that your servlet container doesn't have enough
threads available to service incoming queries. It would be interesting to
see how many threads you have operating on the leader, perhaps use Jstack
to see? It's quite common for there to be dozens of threads on a leader
when indexing. But hundreds of threads his not common.

Best,
Erick

On Wed, Sep 23, 2015 at 10:40 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> here are the logs that didnt make it through the image: (sorry for the
> misalignment on the logs)
>
> 9/23/2015, 7:14:49 PMERRORStreamingSolrClientserror
>
> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
> for connection from pool
>         at
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>         at
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>         at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
>         at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> 9/23/2015, 7:14:49 PMWARNDistributedUpdateProcessorError sending update to
> http://solr8.dawanda.services:8983/solr
>
> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
> for connection from pool
>         at
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>         at
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>         at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
>         at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> 9/23/2015, 7:14:49 PMWARNZkControllerLeader is publishing core=dawanda
> coreNodeName =core_node10 state=down on behalf of un-reachable replica
> http://solr8.dawanda.services:8983/solr/dawanda/; forcePublishState?
> false9/23/2015,
> 7:14:49 PMERRORDistributedUpdateProcessorSetting up to try to start
> recovery on replica http://solr8.dawanda.services:8983/solr/dawanda/
> after:
> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
> connection from pool9/23/2015, 7:14:49
> PMERRORStreamingSolrClientserror9/23/2015,
> 7:14:49 PMWARNDistributedUpdateProcessorError sending update to
> http://solr8.dawanda.services:8983/solr
>
>
> I am pretty sure I saw the overlapping searches but I would forget that for
> now, since it hasn't appeared again. Maybe it was an old log that I
> confused.
>
> I am gonna check the GC logs, and get back with some information.
>
> From the logs, I think the obvious action to take would be to increment the
> zkClientTimeout to 30s (currently 15s)..
>
> On 23 September 2015 at 18:43, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > bq: and when one of the nodes enter recovery mode the leader says this:
> >
> > Hmmm, nothing came through, the mail filter is pretty aggressive
> > about stripping attachments though.
> >
> > bq: You mean 10 seconds apart ?
> >
> > Hmmm, no I mean 10 minutes. That would
> > explain the overlapping searchers since the only
> > time you should be opening a new searcher is
> > when you do a soft commit. But from the log
> > fragment you pasted, that isn't happening so
> > I'm totally mystified why you were getting those.
> > Or was that some time ago and things have
> > changed since then? I totally don't
> > understand why you'd be seeing overlapping
> > searchers, so grasping at straws. Given
> > what you've added, I suspect this is totally
> > off base.
> >
> > I see you're printing GC stats, are any of those
> > GC times exceeding 15 seconds for stop-the-world
> > pauses? That can trigger ZK marking the node as
> > down and the node going into recovery. But with
> > 16G heaps, that is unlikely, but you can see in
> > your logs. If you are getting long stop-the-world
> > pauses, perhaps you're very close to your JVM limit?
> > In _some_ cases I've seen adding more memory
> > decrease GC thrashing, the scenario is this:
> >
> > You're running very close to the limit of your JVM
> > and each collection only reclaims a very little bit of
> > memory (say 0.05%), so it goes immediately into
> > another GC cycle. This is totally a shot in the dark
> > though.
> >
> > Anyway, it's mystifying so far, but from everything
> > you've written your system should not be going into
> > recovery and you shouldn't be seeing
> > overlapping searchers, so I'm missing something
> >
> > Best,
> > Erick
> >
> > On Wed, Sep 23, 2015 at 9:25 AM, Lorenzo Fundaró <
> > lorenzo.fund...@dawandamail.com> wrote:
> >
> > > On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com
> >
> > > wrote:
> > >
> > > > Wow, this is not expected at all. There's no
> > > > way you should, on the face of it, get
> > > > overlapping on-deck searchers.
> > > >
> > > > I recommend you put your maxWarmingSearchers
> > > > back to 2, that's a fail-safe that is there to make
> > > > people look at why they're warming a bunch of
> > > > searchers at once.
> > > >
> > >
> > > Ok, will do.
> > >
> > >
> > > >
> > > > With those settings, it's saying that autowarming is
> > > > taking over 10 minutes.
> > >
> > >
> > > What do you mean ? I don't think the autowarming is taking 10 minutes
> > here.
> > > Do you maybe mean 10 seconds apart ?
> > > Every 10 min I issue a soft commit which would create a new searcher
> > > with several autowarmings, one for each cache. Looking at the stats the
> > > slowest autowarm is from filter cache and it takes around 2 seconds.
> > >
> > >
> > > > This is absurdly long, so either
> > > > something is pathologically wrong with your Solr
> > > > or you're really committing more often than you think.
> > > > Possibly you have a client issuing commits?
> > >
> > >
> > > mm no, I checked my client and I am not doing any commit.
> > > Here are some logs from the leader:
> > >
> > > INFO  - 2015-09-23 16:21:59.803;
> > > org.apache.solr.update.DirectUpdateHandler2; start
> > >
> > >
> >
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > > INFO  - 2015-09-23 16:22:00.981;
> org.apache.solr.core.SolrDeletionPolicy;
> > > SolrDeletionPolicy.onCommit: commits: num=2
> > > commit{dir=NRTCachingDirectory(MMapDirectory@
> > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb
> ;
> > > maxCacheMB=48.0
> maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109}
> > > commit{dir=NRTCachingDirectory(MMapDirectory@
> > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb
> ;
> > > maxCacheMB=48.0
> maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
> > > INFO  - 2015-09-23 16:22:00.981;
> org.apache.solr.core.SolrDeletionPolicy;
> > > newest commit generation = 24110
> > > INFO  - 2015-09-23 16:22:01.010;
> > > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > > INFO  - 2015-09-23 16:22:16.967;
> > > org.apache.solr.update.DirectUpdateHandler2; start
> > >
> > >
> >
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > > INFO  - 2015-09-23 16:22:18.452;
> org.apache.solr.core.SolrDeletionPolicy;
> > > SolrDeletionPolicy.onCommit: commits: num=2
> > > commit{dir=NRTCachingDirectory(MMapDirectory@
> > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb
> ;
> > > maxCacheMB=48.0
> maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
> > >
> > > You can look at your Solr logs and see commits, just
> > > > look for the word "commit". When reading those lines,
> > > > it'll say whether it has openSearcher true or false.
> > > > Are the timestamps when openSearcer=true really
> > > > 10 minutes apart?
> > > >
> > >
> > > You mean 10 seconds apart ?
> > >
> > >
> > > >
> > > > You'll also see autowarm times in your logs, see how
> > > > long they really take. If they really take 10 minutes,
> > > > we need to get to the bottom of that because the
> > > > autowarm counts you're showing in your cache
> > > > configurations don't indicate any problem here.
> > > >
> > > > Bottom line:
> > > > 1> you shouldn't be seeing nodes go into recovery in the
> > > > first place. Are your Solr logs showing any ERROR
> > > > level messages?
> > > >
> > > > 2> it's extremely surprising that you're getting any
> > > > overlapping on-deck searchers. If it turns out that
> > > > your autowarming is really taking more than a few
> > > > seconds, getting a stack trace to see where Solr is
> > > > spending all the time is warranted.
> > > >
> > >
> > > Will do.
> > >
> > > >
> > > > 3> Any clues from the logs _why_ they're going
> > > > into recovery? Also look at your leader's log file
> > > > and see if there are any messages about "leader
> > > > initiated recovery". If you see that, then perhaps
> > > > one of the timeouts is too short.
> > > >
> > > > 4> the tlog size is quite reasonable. It's only relevant
> > > > when a node goes down for some reason anyway,
> > > > so I wouldn't expend too much energy worrying about
> > > > them until we get to the bottom of overlapping
> > > > searchers and nodes going into recovery.
> > > >
> > > > BTW, nice job of laying out the relevant issues and
> > > > adding supporting information! I wish more problem
> > > > statements were as complete. If your Solr is 4.7.0,
> > > > there was a memory problem and you should definitely
> > > > go to 4.7.2. The symptom here is that you'll see
> > > > Out of Memory errors...
> > > >
> > > >
> > > > Best,
> > > > Erick
> > >
> > >
> > > Thank you very much !
> > >
> > > >
> > >
> > >
> > > > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
> > > > lorenzo.fund...@dawandamail.com> wrote:
> > > >
> > > > > Hi !,
> > > > >
> > > > > I keep getting nodes that fall into recovery mode and then issue
> the
> > > > > following log WARN every 10 seconds:
> > > > >
> > > > > WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
> > > > >
> > > > > and sometimes this appears as well:
> > > > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > > > > At higher traffic time, this gets worse and out of 4 nodes only 1
> is
> > > up.
> > > > > I have 4 solr nodes each running two cores A and B of 13GB and
> 1.5GB
> > > > > respectively. Core A gets a lot of index updates and higher query
> > > traffic
> > > > > compared to core B. Core A is going through active/recovery/down
> > states
> > > > > very often.
> > > > > Nodes are coordinated via Zookeeper, we have three, running in
> > > different
> > > > > machines than Solr.
> > > > > Each machine has around 24 cores and between 38 and 48 GB of RAM,
> > with
> > > > each
> > > > > Solr getting 16GB of heap memory.
> > > > > I read this article:
> > > > >
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > > >
> > > > > and decided to apply:
> > > > >
> > > > >      <autoCommit>
> > > > >        <!-- Every 15 seconds -->
> > > > >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
> > > > >        <openSearcher>false</openSearcher>
> > > > >      </autoCommit>
> > > > >
> > > > > and
> > > > >
> > > > >      <autoSoftCommit>
> > > > >        <!-- Every 10 minutes -->
> > > > >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
> > > > >      </autoSoftCommit>
> > > > >
> > > > > I also have these cache configurations:
> > > > >
> > > > >     <filterCache class="solr.LFUCache"
> > > > >                  size="64"
> > > > >                  initialSize="64"
> > > > >                  autowarmCount="32"/>
> > > > >
> > > > >     <queryResultCache class="solr.LRUCache"
> > > > >                      size="512"
> > > > >                      initialSize="512"
> > > > >                      autowarmCount="0"/>
> > > > >
> > > > >     <documentCache class="solr.LRUCache"
> > > > >                    size="1024"
> > > > >                    initialSize="1024"
> > > > >                    autowarmCount="0"/>
> > > > >
> > > > >     <cache name="perSegFilter"
> > > > >       class="solr.search.LRUCache"
> > > > >       size="10"
> > > > >       initialSize="0"
> > > > >       autowarmCount="10"
> > > > >       regenerator="solr.NoOpRegenerator" />
> > > > >
> > > > >        <fieldValueCache class="solr.FastLRUCache"
> > > > >                         size="512"
> > > > >                         autowarmCount="0"
> > > > >                         showItems="32" />
> > > > >
> > > > > I also have this:
> > > > > <maxWarmingSearchers>6</maxWarmingSearchers>
> > > > > The size of the tlogs are usually between 1MB to 8MB.
> > > > > I thought the changes above could improve the situation, but I am
> not
> > > > 100%
> > > > > convinced they did since after 15 min one of the nodes entered
> > recovery
> > > > > mode again.
> > > > >
> > > > > any ideas ?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > > > Cheers !
> > > > >
> > > > > --
> > > > >
> > > > > --
> > > > > Lorenzo Fundaro
> > > > > Backend Engineer
> > > > > E-Mail: lorenzo.fund...@dawandamail.com
> > > > >
> > > > > Fax       + 49 - (0)30 - 25 76 08 52
> > > > > Tel        + 49 - (0)179 - 51 10 982
> > > > >
> > > > > DaWanda GmbH
> > > > > Windscheidstraße 18
> > > > > 10627 Berlin
> > > > >
> > > > > Geschäftsführer: Claudia Helming, Michael Pütz
> > > > > Amtsgericht Charlottenburg HRB 104695 B
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > --
> > > Lorenzo Fundaro
> > > Backend Engineer
> > > E-Mail: lorenzo.fund...@dawandamail.com
> > >
> > > Fax       + 49 - (0)30 - 25 76 08 52
> > > Tel        + 49 - (0)179 - 51 10 982
> > >
> > > DaWanda GmbH
> > > Windscheidstraße 18
> > > 10627 Berlin
> > >
> > > Geschäftsführer: Claudia Helming, Michael Pütz
> > > Amtsgericht Charlottenburg HRB 104695 B
> > >
> >
>
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax       + 49 - (0)30 - 25 76 08 52
> Tel        + 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Michael Pütz
> Amtsgericht Charlottenburg HRB 104695 B
>

Re: recovering mode loop

Reply via email to