RE: 7.2.1 cluster dies within minutes after restart

Markus Jelsma Fri, 02 Feb 2018 04:27:49 -0800

Hello Ere,

It appears that my initial e-mail [1] got lost in the thread. We don't have GC 
issues, the cluster that dies occasionally runs, in general, smooth and quick 
with just 2 GB allocated.


Thanks,
Markus

[1]: 
http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-within-minutes-after-restart-td4372615.html

-----Original message-----
> From:Ere Maijala <ere.maij...@helsinki.fi>
> Sent: Friday 2nd February 2018 8:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Markus,
> 
> I may be stating the obvious here, but I didn't notice garbage 
> collection mentioned in any of the previous messages, so here goes. In 
> our experience almost all of the Zookeeper timeouts etc. have been 
> caused by too long garbage collection pauses. I've summed up my 
> observations here: 
> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html>
> 
> So, in my experience it's relatively easy to cause heavy memory usage 
> with SolrCloud with seemingly innocent queries, and GC can become a 
> problem really quickly even if everything seems to be running smoothly 
> otherwise.
> 
> Regards,
> Ere
> 
> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> > Hello S.G.
> > 
> > We do not complain about speed improvements at all, it is clear 7.x is 
> > faster than its predecessor. The problem is stability and not recovering 
> > from weird circumstances. In general, it is our high load cluster 
> > containing user interaction logs that suffers the most. Our main text 
> > search cluster - receiving much fewer queries - seems mostly unaffected, 
> > except last Sunday. After very short but high burst of queries it entered 
> > the same catatonic state the logs cluster usually dies from.
> > 
> > The query burst immediately caused ZK timeouts and high heap consumption 
> > (not sure which came first of the latter two). The query burst lasted for 
> > 30 minutes, the excessive heap consumption continued for more than 8 hours, 
> > before Solr finally realized it could relax. Most remarkable was that Solr 
> > recovered on its own, ZK timeouts stopped, heap went back to normal.
> > 
> > There seems to be a causality between high load and this state.
> > 
> > We really want to get this fixed for ourselves and everyone else that may 
> > encounter this problem, but i don't know how, so i need much more feedback 
> > and hints from those who have deep understanding of inner working of 
> > Solrcloud and changes since 6.x.
> > 
> > To be clear, we don't have the problem of 15 second ZK timeout, we use 30. 
> > Is 30 too low still? Is it even remotely related to this problem? What does 
> > load have to do with it?
> > 
> > We are not able to reproduce it in lab environments. It can take minutes 
> > after cluster startup for it to occur, but also days.
> > 
> > I've been slightly annoyed by problems that can occur in a board time span, 
> > it is always bad luck for reproduction.
> > 
> > Any help getting further is much appreciated.
> > 
> > Many thanks,
> > Markus
> >   
> > -----Original message-----
> >> From:S G <sg.online.em...@gmail.com>
> >> Sent: Wednesday 31st January 2018 21:48
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>
> >> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> >> And that came out all right.
> >> We saw a performance increase of about 30% in read latencies between 6.6.0
> >> and 7.1.0
> >> And then we saw a performance degradation of about 10% between 7.1.0 and
> >> 7.2.1 in many metrics.
> >> But overall, it still seems better than 6.6.0.
> >>
> >> I will check for the errors too in the logs but the nodes were responsive
> >> for all the 23+ hours we did the load test.
> >>
> >> Disclaimer: We do not test facets and pivots or block-joins. And will add
> >> those features to our load-testing tool sometime this year.
> >>
> >> Thanks
> >> SG
> >>
> >>
> >> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io>
> >> wrote:
> >>
> >>> Ah thanks, i just submitted a patch fixing it.
> >>>
> >>> Anyway, in the end it appears this is not the problem we are seeing as our
> >>> timeouts were already at 30 seconds.
> >>>
> >>> All i know is that at some point nodes start to lose ZK connections due to
> >>> timeouts (logs say so, but all within 30 seconds), the logs are flooded
> >>> with those messages:
> >>> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> >>> 10359ms for sessionid 0x160f9e723c12122
> >>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> >>> 0x60f9e7234f05bb has expired
> >>>
> >>> Then there is a doubling in heap usage and nodes become unresponsive, die
> >>> etc.
> >>>
> >>> We also see those messages in other collections, but not so frequently and
> >>> they don't cause failure in those less loaded clusters.
> >>>
> >>> Ideas?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>> -----Original message-----
> >>>> From:Michael Braun <n3c...@gmail.com>
> >>>> Sent: Monday 29th January 2018 21:09
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>
> >>>> Believe this is reported in https://issues.apache.org/
> >>> jira/browse/SOLR-10471
> >>>>
> >>>>
> >>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> >>> markus.jel...@openindex.io>
> >>>> wrote:
> >>>>
> >>>>> Hello SG,
> >>>>>
> >>>>> The default in solr.in.sh is commented so it defaults to the value
> >>> set in
> >>>>> bin/solr, which is fifteen seconds. Just uncomment the setting in
> >>>>> solr.in.sh and your timeout will be thirty seconds.
> >>>>>
> >>>>> For Solr itself to really default to thirty seconds, Solr's bin/solr
> >>> needs
> >>>>> to be patched to use the correct value.
> >>>>>
> >>>>> Regards,
> >>>>> Markus
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:S G <sg.online.em...@gmail.com>
> >>>>>> Sent: Monday 29th January 2018 20:15
> >>>>>> To: solr-user@lucene.apache.org
> >>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>>
> >>>>>> Hi Markus,
> >>>>>>
> >>>>>> We are in the process of upgrading our clusters to 7.2.1 and I am not
> >>>>> sure
> >>>>>> I quite follow the conversation here.
> >>>>>> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> >>>>> value
> >>>>>> in the config (and it's just a default value being wrong/overridden
> >>>>>> somewhere)?
> >>>>>> Or is it more severe in the sense that any config set for
> >>>>> ZK_CLIENT_TIMEOUT
> >>>>>> by the user is just ignored completely by Solr in 7.2.1 ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> SG
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> >>>>> markus.jel...@openindex.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Ok, i applied the patch and it is clear the timeout is 15000.
> >>> Solr.xml
> >>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default
> >>> unset
> >>>>> in
> >>>>>>> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's
> >>> default is
> >>>>>>> still 15000, not 30000.
> >>>>>>>
> >>>>>>> But, back to my topic. I see we explicitly set it in solr.in.sh to
> >>>>> 30000.
> >>>>>>> To be sure, i applied your patch to a production machine, all our
> >>>>>>> collections run with 30000. So how would that explain this log
> >>> line?
> >>>>>>>
> >>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from
> >>> server
> >>>>> in
> >>>>>>> 22130ms
> >>>>>>>
> >>>>>>> We also see these with smaller values, seven seconds. And, is this
> >>>>>>> actually an indicator of the problems we have?
> >>>>>>>
> >>>>>>> Any ideas?
> >>>>>>>
> >>>>>>> Many thanks,
> >>>>>>> Markus
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Markus Jelsma <markus.jel...@openindex.io>
> >>>>>>>> Sent: Saturday 27th January 2018 10:03
> >>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>> Subject: RE: 7.2.1 cluster dies within minutes after restart
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I grepped for it yesterday and found nothing but 30000 in the
> >>>>> settings,
> >>>>>>> but judging from the weird time out value, you may be right. Let me
> >>>>> apply
> >>>>>>> your patch early next week and check for spurious warnings.
> >>>>>>>>
> >>>>>>>> Another note worthy observation for those working on cloud
> >>> stability
> >>>>> and
> >>>>>>> recovery, whenever this happens, some nodes are also absolutely
> >>> sure
> >>>>> to run
> >>>>>>> OOM. The leaders usually live longest, the replica's don't, their
> >>> heap
> >>>>>>> usage peaks every time, consistently.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Markus
> >>>>>>>>
> >>>>>>>> -----Original message-----
> >>>>>>>>> From:Shawn Heisey <apa...@elyograg.org>
> >>>>>>>>> Sent: Saturday 27th January 2018 0:49
> >>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>>>>>
> >>>>>>>>> On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> >>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard
> >>> from
> >>>>>>> server in 22130ms (although zkClientTimeOut is 30000).
> >>>>>>>>>
> >>>>>>>>> Are you absolutely certain that there is a setting for
> >>>>> zkClientTimeout
> >>>>>>>>> that is actually getting applied?  The default value in Solr's
> >>>>> example
> >>>>>>>>> configs is 30 seconds, but the internal default in the code
> >>> (when
> >>>>> no
> >>>>>>>>> configuration is found) is still 15.  I have confirmed this in
> >>> the
> >>>>>>> code.
> >>>>>>>>>
> >>>>>>>>> Looks like SolrCloud doesn't log the values it's using for
> >>> things
> >>>>> like
> >>>>>>>>> zkClientTimeout.  I think it should.
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Shawn
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
>

RE: 7.2.1 cluster dies within minutes after restart

Reply via email to