RE: Monitor the neighbour JVM using neihbour's member-timeout

2018-01-17 Thread Aravind Musigumpula
Hi Everyone,

Consider a Geode cluster in which some nodes contain a particular type of data 
which is critical to the business and hosts a large amount of data. Some nodes 
may host data which is not critical to the business and hosts less amount of 
data compared to the previous type of nodes.

If both the type of nodes are going through some operation which is making them 
unresponsive, the former type of node may take a couple of seconds extra than 
the later to respond.

In this scenario is it fair to give the same member-timeout to all the members?
What if we want to wait for a little longer time for such nodes.

In the present configuration in geode, we cannot wait a little longer for some 
nodes when compared to do this although we can configure different 
member-timeout for all the nodes. But i think no one will ever configure 
different timeouts for each node because those member-timeouts will be used to 
monitor their neighbors.

In this solution, we all do is wait for the suspected member-timeout instead of 
its own timeout during final check.
It has no backward implications also, if somebody wants to use the existing 
behavior they will continue to use the same member-timeouts for all the nodes. 
So the behavior of the system is preserved.

If you have any concerns in this solution, please let me know.


Thanks,
Aravind Musigumpula 


-Original Message-
From: Aravind Musigumpula 
Sent: Monday, December 18, 2017 6:55 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Hi Community,

Can you please give your suggestions on the below solution.

I have raised a pull request for the same : 
https://github.com/apache/geode/pull/1075 .


Thanks,
Aravind Musigumpula 

-Original Message-
From: Aravind Musigumpula
Sent: Friday, November 03, 2017 3:23 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from 
InternalDistributedMember to NetView and do changes related to backward 
compatibility.

Now I know that there is another way that member can be removed from the view 
i.e if any member is sending a message and waits for ack-wait-threshold, if 
there is no response from the target the sender will do final check and remove 
it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, 
ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. 
The main problem is that we want a member to survive in the view for longer 
time than others.

If we deprecate the settings into one setting and pass the setting to 
monitoring member(say A), then it will use the target member(say B which we 
want to survive in view for longer time) timeout for health monitoring and 
ack-wait-threshold to wait for the response for any message before doing final 
check.
But what if some other member(say C) which is monitoring any other member(say 
D) have the member-timeout and ack-wait-threshold some smaller values. So if 
member C messages to B, C uses the smaller value of ack-wait-threshold(which is 
of member D) to get a response and does the final check again on basis of 
smaller member-timeout. So still member B can be kicked out of the view in 
small amount of time.

I think this can be solved simply if we use the member-timeout of suspected 
member in the final check where we establish TCP connection. We don't need to 
club those three settings as well. We can set the member-timeout of a 
particular member to a higher value and the member which monitors it uses its 
own member-timeout as it is now, but during the final check it uses the 
suspected member-timeout(which is a greater value). The final check is common 
place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?


Thanks,
Aravind Musigumpula 

-Original Message-
From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would 
find it useful.

It's already possible to set different member-timeouts on each node of the 
distributed system but the meaning of the setting is the inverse of what's 
proposed here, so having the current setting be different in each node is 
pretty useless.

I think the initiation of suspect processing ought to be addressed if we make 
this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't 
based on the member-timeout but ought to be.  This would make it possible to 
initiate suspect processing with different timing for different nodes.  It 
would still leave the question of slow backup operations hanging:  If you're 
waiting for one node that's blocked waiting for a response from another

[Spring CI] Spring Data GemFire > Nightly-ApacheGeode > #800 was SUCCESSFUL (with 2324 tests)

2018-01-17 Thread Spring CI

---
Spring Data GemFire > Nightly-ApacheGeode > #800 was successful.
---
Scheduled
2326 tests in total.

https://build.spring.io/browse/SGF-NAG-800/





--
This message is automatically generated by Atlassian Bamboo

Re: Monitor the neighbour JVM using neihbour's member-timeout

2018-01-17 Thread Michael Stolz
Pardon my ignorance, but is this something that should be brought up on the
JGroups community?

--
Mike Stolz
Principal Engineer, GemFire Product Lead
Mobile: +1-631-835-4771
Download the new GemFire book here.


On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula <
aravind.musigump...@amdocs.com> wrote:

> Hi Everyone,
>
> Consider a Geode cluster in which some nodes contain a particular type of
> data which is critical to the business and hosts a large amount of data.
> Some nodes may host data which is not critical to the business and hosts
> less amount of data compared to the previous type of nodes.
>
> If both the type of nodes are going through some operation which is making
> them unresponsive, the former type of node may take a couple of seconds
> extra than the later to respond.
>
> In this scenario is it fair to give the same member-timeout to all the
> members?
> What if we want to wait for a little longer time for such nodes.
>
> In the present configuration in geode, we cannot wait a little longer for
> some nodes when compared to do this although we can configure different
> member-timeout for all the nodes. But i think no one will ever configure
> different timeouts for each node because those member-timeouts will be used
> to monitor their neighbors.
>
> In this solution, we all do is wait for the suspected member-timeout
> instead of its own timeout during final check.
> It has no backward implications also, if somebody wants to use the
> existing behavior they will continue to use the same member-timeouts for
> all the nodes. So the behavior of the system is preserved.
>
> If you have any concerns in this solution, please let me know.
>
>
> Thanks,
> Aravind Musigumpula
>
>
> -Original Message-
> From: Aravind Musigumpula
> Sent: Monday, December 18, 2017 6:55 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Hi Community,
>
> Can you please give your suggestions on the below solution.
>
> I have raised a pull request for the same : https://github.com/apache/
> geode/pull/1075 .
>
>
> Thanks,
> Aravind Musigumpula
>
> -Original Message-
> From: Aravind Musigumpula
> Sent: Friday, November 03, 2017 3:23 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Thanks Bruce for suggestions, I will change the new variables from
> InternalDistributedMember to NetView and do changes related to backward
> compatibility.
>
> Now I know that there is another way that member can be removed from the
> view i.e if any member is sending a message and waits for
> ack-wait-threshold, if there is no response from the target the sender will
> do final check and remove it from the view if there is still no response.
> But I don't understand how deprecating the settings member-timeout,
> ack-wait-threshold, ack-severe-alert-threshold into one will solve the
> problem. The main problem is that we want a member to survive in the view
> for longer time than others.
>
> If we deprecate the settings into one setting and pass the setting to
> monitoring member(say A), then it will use the target member(say B which we
> want to survive in view for longer time) timeout for health monitoring and
> ack-wait-threshold to wait for the response for any message before doing
> final check.
> But what if some other member(say C) which is monitoring any other
> member(say D) have the member-timeout and ack-wait-threshold some smaller
> values. So if member C messages to B, C uses the smaller value of
> ack-wait-threshold(which is of member D) to get a response and does the
> final check again on basis of smaller member-timeout. So still member B can
> be kicked out of the view in small amount of time.
>
> I think this can be solved simply if we use the member-timeout of
> suspected member in the final check where we establish TCP connection. We
> don't need to club those three settings as well. We can set the
> member-timeout of a particular member to a higher value and the member
> which monitors it uses its own member-timeout as it is now, but during the
> final check it uses the suspected member-timeout(which is a greater value).
> The final check is common place in both the no heartbeat scenario and no
> response for a message scenario.
>
> Are there any concerns around this new proposal ?
>
>
> Thanks,
> Aravind Musigumpula
>
> -Original Message-
> From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
> Sent: Thursday, September 07, 2017 10:42 PM
> To: dev@geode.apache.org
> Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout
>
> I think this might be an acceptable change though I doubt many people
> would find it useful.
>
> It's already possible to set different member-timeouts on each node of the
> distributed system but the meaning of the setting is the inverse of what's
> proposed here, 

Geode unit tests 'develop/DistributedTest' took too long to execute

2018-01-17 Thread apachegeodeci
Pipeline results can be found at:

Concourse: 
https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/83