Re: Requests taking too long if one member of the cluster fails

Mario Salazar de Torres Tue, 24 Nov 2020 04:09:52 -0800

Hi everyone,

Regarding what @Jacob Barrett<mailto:jabarr...@vmware.com> mentioned about the 
geode-native timeout handling, yes, I am aware of that, we are working on 
identifying any problems to open PRs.
But my feeling is that this PR https://github.com/apache/geode-native/pull/695 
will dramatically improve things there 🙂


Regarding parametrization, we've been testing several parametrizations and 
things look really promising, there are some minor things to tweak, but we 
barely notice the impact.
Regarding allowing a to configure Geode as a non primary/secondary (a.k.a 
multi-master) distributed system, thin is I've been reading, just out of 
curiosity, and it turns out, it is feasible. I.E: Google 
Spanner<https://static.googleusercontent.com/media/research.google.com/es//pubs/archive/45855.pdf>
Different thing is that it is something that can be easily implemented in 
Geode, or even maybe that's not something we've want. Still, I think that's a 
conversation for another forum.

So really thanks everyone that helped 🙂
BR,
Mario.
________________________________
From: Anthony Baker <bak...@vmware.com>
Sent: Monday, November 23, 2020 6:25 PM
To: dev@geode.apache.org <dev@geode.apache.org>
Cc: miguel.g.gar...@ericsson.com <miguel.g.gar...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

Yes, lowering the member timeout is one approach I’ve seen taken for 
applications that demand ultra low latency.  These workloads need to provide 
not just low “average” or even p99 latency, but put a hard limit on the max 
value.

When you do this you need to ensure coherency across at all aspects of timeouts 
(eg client read timeouts and retries).  You need to ensure that GC pauses don’t 
cause instability in the cluster.  For example, if a GC pause is greater than 
the member timeout, you should go back and re-tune your heap settings to drive 
down GC.  If you are running in a container of VM you need to ensure sufficient 
resources so that the GemFIre process is never paused.

All this presupposes a stable and performant network infrastructure.

Anthony


On Nov 21, 2020, at 1:40 PM, Mario Salazar de Torres 
<mario.salazar.de.tor...@est.tech<mailto:mario.salazar.de.tor...@est.tech>> 
wrote:

So, what I've tried here is to set a really low member-timeout, which results 
the server holding the secondary copy becoming the primary owner in around 
<600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry 
unforeseen consequences.

Re: Requests taking too long if one member of the cluster fails

Reply via email to