Re: Monitor the neighbour JVM using neihbour's member-timeout

Bruce Schuchardt Thu, 07 Sep 2017 10:12:50 -0700

I think this might be an acceptable change though I doubt many peoplewould find it useful.

It's already possible to set different member-timeouts on each node ofthe distributed system but the meaning of the setting is the inverse ofwhat's proposed here, so having the current setting be different in eachnode is pretty useless.

I think the initiation of suspect processing ought to be addressed if wemake this change. The ack-wait-threshold and ack-severe-alert-thresholdaren't based on the member-timeout but ought to be. This would make itpossible to initiate suspect processing with different timing fordifferent nodes. It would still leave the question of slow backupoperations hanging: If you're waiting for one node that's blockedwaiting for a response from another node (say a node holding a backupbucket) you are going to initiate suspect processing on the node you'rewaiting on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going tocough up their member-timeout settings. What should be used as amembership timeout for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecatemember-timeout, ack-wait-threshold and ack-severe-alert-threshold andhave new settings with the "ack" settings being multiples of the newmembership timeout setting.

Concerning the PR, it isn't acceptable in its current form.InternalDistributedMember identifiers are often transmitted in messagesand increasing their size affects performance. Any new memberattributes need to be added to NetView instead of InternalDistributedMember.



On 8/22/17 12:35 AM, Aravind Musigumpula wrote:

Hi Team,

We have a requirement to configure  different member timeout for different 
members as we need some members to survive in the view for longer time than the 
other the members before being kicked out of the view in case they aren't 
responding.


1.       Now with the current monitoring system it is not possible to determine 
when the member will be kicked out of the view if we configure different 
member-timeout's for some required members.

2.       Because if a member is not responding to any heartbeat requests, the 
member who is monitoring the non-responding member will initiate check member 
request.

3.       In this check member request monitoring member pings the 
non-responding member and waits for member-timeout of monitoring member for a 
response.

4.       If still there is no response, it will initiate a final suspect 
request to coordinator where the coordinator does the final check waiting for 
coordinators member-timeout.

5.       If coordinator did not get any response, it will remove the 
non-responding member from the view and publishes it.

6.       So, Here the time period for removing a member depends on its 
monitoring member's and coordinator's timeout. But the monitoring member 
depends on the view but it may change from time to time.

So, now when a monitoring-member doing the check on a member, if we wait for 
the non-responding member's timeout instead of the monitoring member-timeout, 
then the time when the non-responding member will be removed from the view 
depends on its own member-timeout and the coordinators member-timeout.
Hence we can configure different member-timeout for the required members.

I created a pull request based on the above scenario: 
https://github.com/apache/geode/pull/717

Is the above approach correct? Do we have any concerns around this area?
Please give your insights on this issue.

Thanks,
Aravind Musigumpula

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>

Re: Monitor the neighbour JVM using neihbour's member-timeout

Reply via email to