Re: Monitor the neighbour JVM using neihbour's member-timeout

Michael Stolz Wed, 17 Jan 2018 15:40:14 -0800

Pardon my ignorance, but is this something that should be brought up on the
JGroups community?


--
Mike Stolz
Principal Engineer, GemFire Product Lead
Mobile: +1-631-835-4771
Download the new GemFire book here.
<https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-gemfire>

On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula <
aravind.musigump...@amdocs.com> wrote:

> Hi Everyone,
>
> Consider a Geode cluster in which some nodes contain a particular type of
> data which is critical to the business and hosts a large amount of data.
> Some nodes may host data which is not critical to the business and hosts
> less amount of data compared to the previous type of nodes.
>
> If both the type of nodes are going through some operation which is making
> them unresponsive, the former type of node may take a couple of seconds
> extra than the later to respond.
>
> In this scenario is it fair to give the same member-timeout to all the
> members?
> What if we want to wait for a little longer time for such nodes.
>
> In the present configuration in geode, we cannot wait a little longer for
> some nodes when compared to do this although we can configure different
> member-timeout for all the nodes. But i think no one will ever configure
> different timeouts for each node because those member-timeouts will be used
> to monitor their neighbors.
>
> In this solution, we all do is wait for the suspected member-timeout
> instead of its own timeout during final check.
> It has no backward implications also, if somebody wants to use the
> existing behavior they will continue to use the same member-timeouts for
> all the nodes. So the behavior of the system is preserved.
>
> If you have any concerns in this solution, please let me know.
>
>
> Thanks,
> Aravind Musigumpula
>
>
> -----Original Message-----
> From: Aravind Musigumpula
> Sent: Monday, December 18, 2017 6:55 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Hi Community,
>
> Can you please give your suggestions on the below solution.
>
> I have raised a pull request for the same : https://github.com/apache/
> geode/pull/1075 .
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Aravind Musigumpula
> Sent: Friday, November 03, 2017 3:23 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Thanks Bruce for suggestions, I will change the new variables from
> InternalDistributedMember to NetView and do changes related to backward
> compatibility.
>
> Now I know that there is another way that member can be removed from the
> view i.e if any member is sending a message and waits for
> ack-wait-threshold, if there is no response from the target the sender will
> do final check and remove it from the view if there is still no response.
> But I don't understand how deprecating the settings member-timeout,
> ack-wait-threshold, ack-severe-alert-threshold into one will solve the
> problem. The main problem is that we want a member to survive in the view
> for longer time than others.
>
> If we deprecate the settings into one setting and pass the setting to
> monitoring member(say A), then it will use the target member(say B which we
> want to survive in view for longer time) timeout for health monitoring and
> ack-wait-threshold to wait for the response for any message before doing
> final check.
> But what if some other member(say C) which is monitoring any other
> member(say D) have the member-timeout and ack-wait-threshold some smaller
> values. So if member C messages to B, C uses the smaller value of
> ack-wait-threshold(which is of member D) to get a response and does the
> final check again on basis of smaller member-timeout. So still member B can
> be kicked out of the view in small amount of time.
>
> I think this can be solved simply if we use the member-timeout of
> suspected member in the final check where we establish TCP connection. We
> don't need to club those three settings as well. We can set the
> member-timeout of a particular member to a higher value and the member
> which monitors it uses its own member-timeout as it is now, but during the
> final check it uses the suspected member-timeout(which is a greater value).
> The final check is common place in both the no heartbeat scenario and no
> response for a message scenario.
>
> Are there any concerns around this new proposal ?
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
> Sent: Thursday, September 07, 2017 10:42 PM
> To: dev@geode.apache.org
> Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout
>
> I think this might be an acceptable change though I doubt many people
> would find it useful.
>
> It's already possible to set different member-timeouts on each node of the
> distributed system but the meaning of the setting is the inverse of what's
> proposed here, so having the current setting be different in each node is
> pretty useless.
>
> I think the initiation of suspect processing ought to be addressed if we
> make this change.  The ack-wait-threshold and ack-severe-alert-threshold
> aren't based on the member-timeout but ought to be.  This would make it
> possible to initiate suspect processing with different timing for different
> nodes.  It would still leave the question of slow backup operations
> hanging:  If you're waiting for one node that's blocked waiting for a
> response from another node (say a node holding a backup
> bucket) you are going to initiate suspect processing on the node you're
> waiting on & not those other (backup) nodes.
>
> Rolling upgrade will also be a problem since old members aren't going to
> cough up their member-timeout settings.  What should be used as a
> membership timeout for the old members during an upgrade?
>
> If we proceed with this idea I'd prefer that we deprecate member-timeout,
> ack-wait-threshold and ack-severe-alert-threshold and have new settings
> with the "ack" settings being multiples of the new membership timeout
> setting.
>
> Concerning the PR, it isn't acceptable in its current form.
> InternalDistributedMember identifiers are often transmitted in messages
> and increasing their size affects performance.  Any new member attributes
> need to be added to NetView instead of InternalDistributedMember.
>
>
> On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> > Hi Team,
> >
> > We have a requirement to configure  different member timeout for
> different members as we need some members to survive in the view for longer
> time than the other the members before being kicked out of the view in case
> they aren't responding.
> >
> >
> > 1.       Now with the current monitoring system it is not possible to
> determine when the member will be kicked out of the view if we configure
> different member-timeout's for some required members.
> >
> > 2.       Because if a member is not responding to any heartbeat
> requests, the member who is monitoring the non-responding member will
> initiate check member request.
> >
> > 3.       In this check member request monitoring member pings the
> non-responding member and waits for member-timeout of monitoring member for
> a response.
> >
> > 4.       If still there is no response, it will initiate a final suspect
> request to coordinator where the coordinator does the final check waiting
> for coordinators member-timeout.
> >
> > 5.       If coordinator did not get any response, it will remove the
> non-responding member from the view and publishes it.
> >
> > 6.       So, Here the time period for removing a member depends on its
> monitoring member's and coordinator's timeout. But the monitoring member
> depends on the view but it may change from time to time.
> >
> > So, now when a monitoring-member doing the check on a member, if we wait
> for the non-responding member's timeout instead of the monitoring
> member-timeout, then the time when the non-responding member will be
> removed from the view depends on its own member-timeout and the
> coordinators member-timeout.
> > Hence we can configure different member-timeout for the required members.
> >
> > I created a pull request based on the above scenario:
> > https://github.com/apache/geode/pull/717
> >
> > Is the above approach correct? Do we have any concerns around this area?
> > Please give your insights on this issue.
> >
> > Thanks,
> > Aravind Musigumpula
> >
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer
> > <https://www.amdocs.com/about/email-disclaimer>
> >
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>
>

Re: Monitor the neighbour JVM using neihbour's member-timeout

Reply via email to