Hello Ere, It appears that my initial e-mail [1] got lost in the thread. We don't have GC issues, the cluster that dies occasionally runs, in general, smooth and quick with just 2 GB allocated.
Thanks, Markus [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-within-minutes-after-restart-td4372615.html -----Original message----- > From:Ere Maijala <ere.maij...@helsinki.fi> > Sent: Friday 2nd February 2018 8:49 > To: solr-user@lucene.apache.org > Subject: Re: 7.2.1 cluster dies within minutes after restart > > Markus, > > I may be stating the obvious here, but I didn't notice garbage > collection mentioned in any of the previous messages, so here goes. In > our experience almost all of the Zookeeper timeouts etc. have been > caused by too long garbage collection pauses. I've summed up my > observations here: > <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html> > > So, in my experience it's relatively easy to cause heavy memory usage > with SolrCloud with seemingly innocent queries, and GC can become a > problem really quickly even if everything seems to be running smoothly > otherwise. > > Regards, > Ere > > Markus Jelsma kirjoitti 31.1.2018 klo 23.56: > > Hello S.G. > > > > We do not complain about speed improvements at all, it is clear 7.x is > > faster than its predecessor. The problem is stability and not recovering > > from weird circumstances. In general, it is our high load cluster > > containing user interaction logs that suffers the most. Our main text > > search cluster - receiving much fewer queries - seems mostly unaffected, > > except last Sunday. After very short but high burst of queries it entered > > the same catatonic state the logs cluster usually dies from. > > > > The query burst immediately caused ZK timeouts and high heap consumption > > (not sure which came first of the latter two). The query burst lasted for > > 30 minutes, the excessive heap consumption continued for more than 8 hours, > > before Solr finally realized it could relax. Most remarkable was that Solr > > recovered on its own, ZK timeouts stopped, heap went back to normal. > > > > There seems to be a causality between high load and this state. > > > > We really want to get this fixed for ourselves and everyone else that may > > encounter this problem, but i don't know how, so i need much more feedback > > and hints from those who have deep understanding of inner working of > > Solrcloud and changes since 6.x. > > > > To be clear, we don't have the problem of 15 second ZK timeout, we use 30. > > Is 30 too low still? Is it even remotely related to this problem? What does > > load have to do with it? > > > > We are not able to reproduce it in lab environments. It can take minutes > > after cluster startup for it to occur, but also days. > > > > I've been slightly annoyed by problems that can occur in a board time span, > > it is always bad luck for reproduction. > > > > Any help getting further is much appreciated. > > > > Many thanks, > > Markus > > > > -----Original message----- > >> From:S G <sg.online.em...@gmail.com> > >> Sent: Wednesday 31st January 2018 21:48 > >> To: solr-user@lucene.apache.org > >> Subject: Re: 7.2.1 cluster dies within minutes after restart > >> > >> We did some basic load testing on our 7.1.0 and 7.2.1 clusters. > >> And that came out all right. > >> We saw a performance increase of about 30% in read latencies between 6.6.0 > >> and 7.1.0 > >> And then we saw a performance degradation of about 10% between 7.1.0 and > >> 7.2.1 in many metrics. > >> But overall, it still seems better than 6.6.0. > >> > >> I will check for the errors too in the logs but the nodes were responsive > >> for all the 23+ hours we did the load test. > >> > >> Disclaimer: We do not test facets and pivots or block-joins. And will add > >> those features to our load-testing tool sometime this year. > >> > >> Thanks > >> SG > >> > >> > >> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io> > >> wrote: > >> > >>> Ah thanks, i just submitted a patch fixing it. > >>> > >>> Anyway, in the end it appears this is not the problem we are seeing as our > >>> timeouts were already at 30 seconds. > >>> > >>> All i know is that at some point nodes start to lose ZK connections due to > >>> timeouts (logs say so, but all within 30 seconds), the logs are flooded > >>> with those messages: > >>> o.a.z.ClientCnxn Client session timed out, have not heard from server in > >>> 10359ms for sessionid 0x160f9e723c12122 > >>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session > >>> 0x60f9e7234f05bb has expired > >>> > >>> Then there is a doubling in heap usage and nodes become unresponsive, die > >>> etc. > >>> > >>> We also see those messages in other collections, but not so frequently and > >>> they don't cause failure in those less loaded clusters. > >>> > >>> Ideas? > >>> > >>> Thanks, > >>> Markus > >>> > >>> -----Original message----- > >>>> From:Michael Braun <n3c...@gmail.com> > >>>> Sent: Monday 29th January 2018 21:09 > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Re: 7.2.1 cluster dies within minutes after restart > >>>> > >>>> Believe this is reported in https://issues.apache.org/ > >>> jira/browse/SOLR-10471 > >>>> > >>>> > >>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma < > >>> markus.jel...@openindex.io> > >>>> wrote: > >>>> > >>>>> Hello SG, > >>>>> > >>>>> The default in solr.in.sh is commented so it defaults to the value > >>> set in > >>>>> bin/solr, which is fifteen seconds. Just uncomment the setting in > >>>>> solr.in.sh and your timeout will be thirty seconds. > >>>>> > >>>>> For Solr itself to really default to thirty seconds, Solr's bin/solr > >>> needs > >>>>> to be patched to use the correct value. > >>>>> > >>>>> Regards, > >>>>> Markus > >>>>> > >>>>> -----Original message----- > >>>>>> From:S G <sg.online.em...@gmail.com> > >>>>>> Sent: Monday 29th January 2018 20:15 > >>>>>> To: solr-user@lucene.apache.org > >>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart > >>>>>> > >>>>>> Hi Markus, > >>>>>> > >>>>>> We are in the process of upgrading our clusters to 7.2.1 and I am not > >>>>> sure > >>>>>> I quite follow the conversation here. > >>>>>> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher > >>>>> value > >>>>>> in the config (and it's just a default value being wrong/overridden > >>>>>> somewhere)? > >>>>>> Or is it more severe in the sense that any config set for > >>>>> ZK_CLIENT_TIMEOUT > >>>>>> by the user is just ignored completely by Solr in 7.2.1 ? > >>>>>> > >>>>>> Thanks > >>>>>> SG > >>>>>> > >>>>>> > >>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma < > >>>>> markus.jel...@openindex.io> > >>>>>> wrote: > >>>>>> > >>>>>>> Ok, i applied the patch and it is clear the timeout is 15000. > >>> Solr.xml > >>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default > >>> unset > >>>>> in > >>>>>>> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's > >>> default is > >>>>>>> still 15000, not 30000. > >>>>>>> > >>>>>>> But, back to my topic. I see we explicitly set it in solr.in.sh to > >>>>> 30000. > >>>>>>> To be sure, i applied your patch to a production machine, all our > >>>>>>> collections run with 30000. So how would that explain this log > >>> line? > >>>>>>> > >>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from > >>> server > >>>>> in > >>>>>>> 22130ms > >>>>>>> > >>>>>>> We also see these with smaller values, seven seconds. And, is this > >>>>>>> actually an indicator of the problems we have? > >>>>>>> > >>>>>>> Any ideas? > >>>>>>> > >>>>>>> Many thanks, > >>>>>>> Markus > >>>>>>> > >>>>>>> > >>>>>>> -----Original message----- > >>>>>>>> From:Markus Jelsma <markus.jel...@openindex.io> > >>>>>>>> Sent: Saturday 27th January 2018 10:03 > >>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>> Subject: RE: 7.2.1 cluster dies within minutes after restart > >>>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> I grepped for it yesterday and found nothing but 30000 in the > >>>>> settings, > >>>>>>> but judging from the weird time out value, you may be right. Let me > >>>>> apply > >>>>>>> your patch early next week and check for spurious warnings. > >>>>>>>> > >>>>>>>> Another note worthy observation for those working on cloud > >>> stability > >>>>> and > >>>>>>> recovery, whenever this happens, some nodes are also absolutely > >>> sure > >>>>> to run > >>>>>>> OOM. The leaders usually live longest, the replica's don't, their > >>> heap > >>>>>>> usage peaks every time, consistently. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Markus > >>>>>>>> > >>>>>>>> -----Original message----- > >>>>>>>>> From:Shawn Heisey <apa...@elyograg.org> > >>>>>>>>> Sent: Saturday 27th January 2018 0:49 > >>>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart > >>>>>>>>> > >>>>>>>>> On 1/26/2018 10:02 AM, Markus Jelsma wrote: > >>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard > >>> from > >>>>>>> server in 22130ms (although zkClientTimeOut is 30000). > >>>>>>>>> > >>>>>>>>> Are you absolutely certain that there is a setting for > >>>>> zkClientTimeout > >>>>>>>>> that is actually getting applied? The default value in Solr's > >>>>> example > >>>>>>>>> configs is 30 seconds, but the internal default in the code > >>> (when > >>>>> no > >>>>>>>>> configuration is found) is still 15. I have confirmed this in > >>> the > >>>>>>> code. > >>>>>>>>> > >>>>>>>>> Looks like SolrCloud doesn't log the values it's using for > >>> things > >>>>> like > >>>>>>>>> zkClientTimeout. I think it should. > >>>>>>>>> > >>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915 > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Shawn > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > -- > Ere Maijala > Kansalliskirjasto / The National Library of Finland >