Re: 7.2.1 cluster dies within minutes after restart

Ere Maijala Thu, 01 Feb 2018 23:49:33 -0800

Markus,

I may be stating the obvious here, but I didn't notice garbagecollection mentioned in any of the previous messages, so here goes. Inour experience almost all of the Zookeeper timeouts etc. have beencaused by too long garbage collection pauses. I've summed up myobservations here:<https://www.mail-archive.com/[email protected]/msg135857.html>

So, in my experience it's relatively easy to cause heavy memory usagewith SolrCloud with seemingly innocent queries, and GC can become aproblem really quickly even if everything seems to be running smoothlyotherwise.


Regards,
Ere

Markus Jelsma kirjoitti 31.1.2018 klo 23.56:

Hello S.G.

We do not complain about speed improvements at all, it is clear 7.x is faster 
than its predecessor. The problem is stability and not recovering from weird 
circumstances. In general, it is our high load cluster containing user 
interaction logs that suffers the most. Our main text search cluster - 
receiving much fewer queries - seems mostly unaffected, except last Sunday. 
After very short but high burst of queries it entered the same catatonic state 
the logs cluster usually dies from.

The query burst immediately caused ZK timeouts and high heap consumption (not 
sure which came first of the latter two). The query burst lasted for 30 
minutes, the excessive heap consumption continued for more than 8 hours, before 
Solr finally realized it could relax. Most remarkable was that Solr recovered 
on its own, ZK timeouts stopped, heap went back to normal.

There seems to be a causality between high load and this state.

We really want to get this fixed for ourselves and everyone else that may 
encounter this problem, but i don't know how, so i need much more feedback and 
hints from those who have deep understanding of inner working of Solrcloud and 
changes since 6.x.

To be clear, we don't have the problem of 15 second ZK timeout, we use 30. Is 
30 too low still? Is it even remotely related to this problem? What does load 
have to do with it?

We are not able to reproduce it in lab environments. It can take minutes after 
cluster startup for it to occur, but also days.

I've been slightly annoyed by problems that can occur in a board time span, it 
is always bad luck for reproduction.

Any help getting further is much appreciated.

Many thanks,
Markus

-----Original message-----

From:S G <[email protected]>
Sent: Wednesday 31st January 2018 21:48
To: [email protected]
Subject: Re: 7.2.1 cluster dies within minutes after restart

We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
And that came out all right.
We saw a performance increase of about 30% in read latencies between 6.6.0
and 7.1.0
And then we saw a performance degradation of about 10% between 7.1.0 and
7.2.1 in many metrics.
But overall, it still seems better than 6.6.0.

I will check for the errors too in the logs but the nodes were responsive
for all the 23+ hours we did the load test.

Disclaimer: We do not test facets and pivots or block-joins. And will add
those features to our load-testing tool sometime this year.

Thanks
SG

On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <[email protected]>
wrote:

Ah thanks, i just submitted a patch fixing it.

Anyway, in the end it appears this is not the problem we are seeing as our
timeouts were already at 30 seconds.

All i know is that at some point nodes start to lose ZK connections due to
timeouts (logs say so, but all within 30 seconds), the logs are flooded
with those messages:
o.a.z.ClientCnxn Client session timed out, have not heard from server in
10359ms for sessionid 0x160f9e723c12122
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
0x60f9e7234f05bb has expired

Then there is a doubling in heap usage and nodes become unresponsive, die
etc.

We also see those messages in other collections, but not so frequently and
they don't cause failure in those less loaded clusters.

Ideas?

Thanks,
Markus

-----Original message-----

From:Michael Braun <[email protected]>
Sent: Monday 29th January 2018 21:09
To: [email protected]
Subject: Re: 7.2.1 cluster dies within minutes after restart

Believe this is reported in https://issues.apache.org/

jira/browse/SOLR-10471



On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <

[email protected]>

wrote:

Hello SG,

The default in solr.in.sh is commented so it defaults to the value

set in

bin/solr, which is fifteen seconds. Just uncomment the setting in
solr.in.sh and your timeout will be thirty seconds.

For Solr itself to really default to thirty seconds, Solr's bin/solr

needs

to be patched to use the correct value.

Regards,
Markus

-----Original message-----

From:S G <[email protected]>
Sent: Monday 29th January 2018 20:15
To: [email protected]
Subject: Re: 7.2.1 cluster dies within minutes after restart

Hi Markus,

We are in the process of upgrading our clusters to 7.2.1 and I am not

sure

I quite follow the conversation here.
Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher

value

in the config (and it's just a default value being wrong/overridden
somewhere)?
Or is it more severe in the sense that any config set for

ZK_CLIENT_TIMEOUT

by the user is just ignored completely by Solr in 7.2.1 ?

Thanks
SG


On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <

[email protected]>

wrote:

Ok, i applied the patch and it is clear the timeout is 15000.

Solr.xml

says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default

unset

in

solr.in.sh,but set in bin/solr to 15000. So it seems Solr's

default is

still 15000, not 30000.

But, back to my topic. I see we explicitly set it in solr.in.sh to

30000.

To be sure, i applied your patch to a production machine, all our
collections run with 30000. So how would that explain this log

line?


o.a.z.ClientCnxn Client session timed out, have not heard from

server

in

22130ms

We also see these with smaller values, seven seconds. And, is this
actually an indicator of the problems we have?

Any ideas?

Many thanks,
Markus


-----Original message-----

From:Markus Jelsma <[email protected]>
Sent: Saturday 27th January 2018 10:03
To: [email protected]
Subject: RE: 7.2.1 cluster dies within minutes after restart

Hello,

I grepped for it yesterday and found nothing but 30000 in the

settings,

but judging from the weird time out value, you may be right. Let me

apply

your patch early next week and check for spurious warnings.


Another note worthy observation for those working on cloud

stability

and

recovery, whenever this happens, some nodes are also absolutely

sure

to run

OOM. The leaders usually live longest, the replica's don't, their

heap

usage peaks every time, consistently.


Thanks,
Markus

-----Original message-----

From:Shawn Heisey <[email protected]>
Sent: Saturday 27th January 2018 0:49
To: [email protected]
Subject: Re: 7.2.1 cluster dies within minutes after restart

On 1/26/2018 10:02 AM, Markus Jelsma wrote:

o.a.z.ClientCnxn Client session timed out, have not heard

from

server in 22130ms (although zkClientTimeOut is 30000).


Are you absolutely certain that there is a setting for

zkClientTimeout

that is actually getting applied?  The default value in Solr's

example

configs is 30 seconds, but the internal default in the code

(when

no

configuration is found) is still 15.  I have confirmed this in

the

code.


Looks like SolrCloud doesn't log the values it's using for

things

like

zkClientTimeout.  I think it should.

https://issues.apache.org/jira/browse/SOLR-11915

Thanks,
Shawn


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: 7.2.1 cluster dies within minutes after restart

Reply via email to