Pardon my ignorance, but is this something that should be brought up on the JGroups community?
-- Mike Stolz Principal Engineer, GemFire Product Lead Mobile: +1-631-835-4771 Download the new GemFire book here. <https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-gemfire> On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula < aravind.musigump...@amdocs.com> wrote: > Hi Everyone, > > Consider a Geode cluster in which some nodes contain a particular type of > data which is critical to the business and hosts a large amount of data. > Some nodes may host data which is not critical to the business and hosts > less amount of data compared to the previous type of nodes. > > If both the type of nodes are going through some operation which is making > them unresponsive, the former type of node may take a couple of seconds > extra than the later to respond. > > In this scenario is it fair to give the same member-timeout to all the > members? > What if we want to wait for a little longer time for such nodes. > > In the present configuration in geode, we cannot wait a little longer for > some nodes when compared to do this although we can configure different > member-timeout for all the nodes. But i think no one will ever configure > different timeouts for each node because those member-timeouts will be used > to monitor their neighbors. > > In this solution, we all do is wait for the suspected member-timeout > instead of its own timeout during final check. > It has no backward implications also, if somebody wants to use the > existing behavior they will continue to use the same member-timeouts for > all the nodes. So the behavior of the system is preserved. > > If you have any concerns in this solution, please let me know. > > > Thanks, > Aravind Musigumpula > > > -----Original Message----- > From: Aravind Musigumpula > Sent: Monday, December 18, 2017 6:55 PM > To: dev@geode.apache.org > Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout > > Hi Community, > > Can you please give your suggestions on the below solution. > > I have raised a pull request for the same : https://github.com/apache/ > geode/pull/1075 . > > > Thanks, > Aravind Musigumpula > > -----Original Message----- > From: Aravind Musigumpula > Sent: Friday, November 03, 2017 3:23 PM > To: dev@geode.apache.org > Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout > > Thanks Bruce for suggestions, I will change the new variables from > InternalDistributedMember to NetView and do changes related to backward > compatibility. > > Now I know that there is another way that member can be removed from the > view i.e if any member is sending a message and waits for > ack-wait-threshold, if there is no response from the target the sender will > do final check and remove it from the view if there is still no response. > But I don't understand how deprecating the settings member-timeout, > ack-wait-threshold, ack-severe-alert-threshold into one will solve the > problem. The main problem is that we want a member to survive in the view > for longer time than others. > > If we deprecate the settings into one setting and pass the setting to > monitoring member(say A), then it will use the target member(say B which we > want to survive in view for longer time) timeout for health monitoring and > ack-wait-threshold to wait for the response for any message before doing > final check. > But what if some other member(say C) which is monitoring any other > member(say D) have the member-timeout and ack-wait-threshold some smaller > values. So if member C messages to B, C uses the smaller value of > ack-wait-threshold(which is of member D) to get a response and does the > final check again on basis of smaller member-timeout. So still member B can > be kicked out of the view in small amount of time. > > I think this can be solved simply if we use the member-timeout of > suspected member in the final check where we establish TCP connection. We > don't need to club those three settings as well. We can set the > member-timeout of a particular member to a higher value and the member > which monitors it uses its own member-timeout as it is now, but during the > final check it uses the suspected member-timeout(which is a greater value). > The final check is common place in both the no heartbeat scenario and no > response for a message scenario. > > Are there any concerns around this new proposal ? > > > Thanks, > Aravind Musigumpula > > -----Original Message----- > From: Bruce Schuchardt [mailto:bschucha...@pivotal.io] > Sent: Thursday, September 07, 2017 10:42 PM > To: dev@geode.apache.org > Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout > > I think this might be an acceptable change though I doubt many people > would find it useful. > > It's already possible to set different member-timeouts on each node of the > distributed system but the meaning of the setting is the inverse of what's > proposed here, so having the current setting be different in each node is > pretty useless. > > I think the initiation of suspect processing ought to be addressed if we > make this change. The ack-wait-threshold and ack-severe-alert-threshold > aren't based on the member-timeout but ought to be. This would make it > possible to initiate suspect processing with different timing for different > nodes. It would still leave the question of slow backup operations > hanging: If you're waiting for one node that's blocked waiting for a > response from another node (say a node holding a backup > bucket) you are going to initiate suspect processing on the node you're > waiting on & not those other (backup) nodes. > > Rolling upgrade will also be a problem since old members aren't going to > cough up their member-timeout settings. What should be used as a > membership timeout for the old members during an upgrade? > > If we proceed with this idea I'd prefer that we deprecate member-timeout, > ack-wait-threshold and ack-severe-alert-threshold and have new settings > with the "ack" settings being multiples of the new membership timeout > setting. > > Concerning the PR, it isn't acceptable in its current form. > InternalDistributedMember identifiers are often transmitted in messages > and increasing their size affects performance. Any new member attributes > need to be added to NetView instead of InternalDistributedMember. > > > On 8/22/17 12:35 AM, Aravind Musigumpula wrote: > > Hi Team, > > > > We have a requirement to configure different member timeout for > different members as we need some members to survive in the view for longer > time than the other the members before being kicked out of the view in case > they aren't responding. > > > > > > 1. Now with the current monitoring system it is not possible to > determine when the member will be kicked out of the view if we configure > different member-timeout's for some required members. > > > > 2. Because if a member is not responding to any heartbeat > requests, the member who is monitoring the non-responding member will > initiate check member request. > > > > 3. In this check member request monitoring member pings the > non-responding member and waits for member-timeout of monitoring member for > a response. > > > > 4. If still there is no response, it will initiate a final suspect > request to coordinator where the coordinator does the final check waiting > for coordinators member-timeout. > > > > 5. If coordinator did not get any response, it will remove the > non-responding member from the view and publishes it. > > > > 6. So, Here the time period for removing a member depends on its > monitoring member's and coordinator's timeout. But the monitoring member > depends on the view but it may change from time to time. > > > > So, now when a monitoring-member doing the check on a member, if we wait > for the non-responding member's timeout instead of the monitoring > member-timeout, then the time when the non-responding member will be > removed from the view depends on its own member-timeout and the > coordinators member-timeout. > > Hence we can configure different member-timeout for the required members. > > > > I created a pull request based on the above scenario: > > https://github.com/apache/geode/pull/717 > > > > Is the above approach correct? Do we have any concerns around this area? > > Please give your insights on this issue. > > > > Thanks, > > Aravind Musigumpula > > > > This message and the information contained herein is proprietary and > > confidential and subject to the Amdocs policy statement, > > > > you may review at https://www.amdocs.com/about/email-disclaimer > > <https://www.amdocs.com/about/email-disclaimer> > > > > This message and the information contained herein is proprietary and > confidential and subject to the Amdocs policy statement, > > you may review at https://www.amdocs.com/about/email-disclaimer < > https://www.amdocs.com/about/email-disclaimer> > > This message and the information contained herein is proprietary and > confidential and subject to the Amdocs policy statement, > > you may review at https://www.amdocs.com/about/email-disclaimer < > https://www.amdocs.com/about/email-disclaimer> > >