RE: 7.2.1 cluster dies within minutes after restart

Markus Jelsma Wed, 31 Jan 2018 03:13:48 -0800

Ah thanks, i just submitted a patch fixing it.

Anyway, in the end it appears this is not the problem we are seeing as our 
timeouts were already at 30 seconds.


All i know is that at some point nodes start to lose ZK connections due to 
timeouts (logs say so, but all within 30 seconds), the logs are flooded with 
those messages:
o.a.z.ClientCnxn Client session timed out, have not heard from server in 
10359ms for sessionid 0x160f9e723c12122
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session 
0x60f9e7234f05bb has expired

Then there is a doubling in heap usage and nodes become unresponsive, die etc. 

We also see those messages in other collections, but not so frequently and they 
don't cause failure in those less loaded clusters.

Ideas?

Thanks,
Markus

-----Original message-----
> From:Michael Braun <[email protected]>
> Sent: Monday 29th January 2018 21:09
> To: [email protected]
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Believe this is reported in https://issues.apache.org/jira/browse/SOLR-10471
> 
> 
> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Hello SG,
> >
> > The default in solr.in.sh is commented so it defaults to the value set in
> > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > solr.in.sh and your timeout will be thirty seconds.
> >
> > For Solr itself to really default to thirty seconds, Solr's bin/solr needs
> > to be patched to use the correct value.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:S G <[email protected]>
> > > Sent: Monday 29th January 2018 20:15
> > > To: [email protected]
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Hi Markus,
> > >
> > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > sure
> > > I quite follow the conversation here.
> > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > value
> > > in the config (and it's just a default value being wrong/overridden
> > > somewhere)?
> > > Or is it more severe in the sense that any config set for
> > ZK_CLIENT_TIMEOUT
> > > by the user is just ignored completely by Solr in 7.2.1 ?
> > >
> > > Thanks
> > > SG
> > >
> > >
> > > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> > [email protected]>
> > > wrote:
> > >
> > > > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > > > says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default unset
> > in
> > > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > > > still 15000, not 30000.
> > > >
> > > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> > 30000.
> > > > To be sure, i applied your patch to a production machine, all our
> > > > collections run with 30000. So how would that explain this log line?
> > > >
> > > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> > in
> > > > 22130ms
> > > >
> > > > We also see these with smaller values, seven seconds. And, is this
> > > > actually an indicator of the problems we have?
> > > >
> > > > Any ideas?
> > > >
> > > > Many thanks,
> > > > Markus
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:Markus Jelsma <[email protected]>
> > > > > Sent: Saturday 27th January 2018 10:03
> > > > > To: [email protected]
> > > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > Hello,
> > > > >
> > > > > I grepped for it yesterday and found nothing but 30000 in the
> > settings,
> > > > but judging from the weird time out value, you may be right. Let me
> > apply
> > > > your patch early next week and check for spurious warnings.
> > > > >
> > > > > Another note worthy observation for those working on cloud stability
> > and
> > > > recovery, whenever this happens, some nodes are also absolutely sure
> > to run
> > > > OOM. The leaders usually live longest, the replica's don't, their heap
> > > > usage peaks every time, consistently.
> > > > >
> > > > > Thanks,
> > > > > Markus
> > > > >
> > > > > -----Original message-----
> > > > > > From:Shawn Heisey <[email protected]>
> > > > > > Sent: Saturday 27th January 2018 0:49
> > > > > > To: [email protected]
> > > > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > > > >
> > > > > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> > > > server in 22130ms (although zkClientTimeOut is 30000).
> > > > > >
> > > > > > Are you absolutely certain that there is a setting for
> > zkClientTimeout
> > > > > > that is actually getting applied?  The default value in Solr's
> > example
> > > > > > configs is 30 seconds, but the internal default in the code (when
> > no
> > > > > > configuration is found) is still 15.  I have confirmed this in the
> > > > code.
> > > > > >
> > > > > > Looks like SolrCloud doesn't log the values it's using for things
> > like
> > > > > > zkClientTimeout.  I think it should.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/SOLR-11915
> > > > > >
> > > > > > Thanks,
> > > > > > Shawn
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: 7.2.1 cluster dies within minutes after restart

Reply via email to