Re: Unable to get behavior described in documentation when using durable native client

2020-04-17 Thread Jakov Varenina

Hi Jacob,

Thanks for your response!

Regarding GEODE-7944, "Unable to deserialize *membership id* 
java.io.EOFException" is not logged but thrown, and it breaks processing 
of QueueConnectionRequest in locator. This reflects in native client 
with "No locators found" even though they are available. Happens only 
when native client with subscription is enabled and locator started with 
--log-level=debug.


I haven't had time to test and analyze in detail native durable client 
yet. So far I could only confirm that when using native durable client 
then locator behaves differently than how it is described in 
documentation (see previous mail) and how java client works:


It seems that native durable client always requests from locator all 
available servers (redundant= -1, findDurable=false) with 
QueueConnectionRequest. Locator returns them in QueueConnectionResponse 
ordered by load (best...worst). While for java durable client, locator 
use *membership id *from QueueConnectionRequest to locate servers that 
host client queue and send them back in QueueConnectionResponse as 
described in previous mail. I expect that native durable client is 
handling re-connection to same servers queue somehow also, but this has 
to be investigated yet. Any hints or comments related to this would be 
really appreciated.


BRs,

Jakov

On 15. 04. 2020. 10:07, Jacob Barrett wrote:

Looking back at history the native library has always only ever set that 
findDurable flag to false. I traced it back to its initial commit. Aside from 
the annoying log message, does client durable connection work correctly?


On Apr 14, 2020, at 10:56 PM, Jakov Varenina  wrote:

Hi all,

Could you please help me understand behavior of the native client when 
configured as durable?

I have been working on a bug GEODE-7944 
 which results with exception 
"Unable to deserialize membership id java.io.EOFException" on locator only when debug 
is enabled. This happens because native client, only when subscription is enabled, sends 
towards locator QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID (not 
properly serialized) and therefore exception occurs when locator tries to deserialize 
membership id to log it at debug level.

I was trying to figure out why would locator need ClientProxyMembershipID from 
native client and found following paragraph in the documentation (copied from 
https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html):

   /For durable subscriptions, the server locator must be able to
   locate the servers that host the queues for the durable client. When
   a durable client sends a request, the server locator queries all the
   available servers to see if they are hosting the subscription region
   queue for the durable client. If the server is located, the client
   is connected to the server hosting the subscription region queue./

Locator behaves as described in above paragraph only when it receives ///QueueConnectionRequest 
with ///findDurable flag set to "true" //and with valid membership i//d. //I noticed that 
unlike java client, the native client always sets //findDurable// to //"false" //and 
therefore locator will never behave as described in above paragraph when native client is used.

Does anybody know why native client always sets //findDurable=false//?

BRs,

Jakov


[PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Ju@N
Hello devs,

I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12* branch.
The bug is not new, seems quite old actually, but still seems pretty
critical as it can lead to data loss in WAN topologies.
Long story short: when there are multiple parallel *gateway-senders* attached
to the same region and the user decides to stop, detach and *destroy* one
of the senders (the destroy part is what actually causes the problem), the
rest of the senders attached will silently stop replicating events to the
the remote clusters and all these events will be lost.
The fix has been merged into develop through commit *bfbb398 [2]*.
Best regards.

[1]: https://issues.apache.org/jira/browse/GEODE-7940
[2]:
https://github.com/apache/geode/commit/bfbb398891c5d96fa3a5975365b29d71bd849ad6

-- 
Ju@N


Re: [PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Owen Nichols
Hi Juan, this looks like a great fix and definitely meets the “critical fix" 
standard.  Also thanks for the detailed description.

I noticed it was just merged to develop very recently.  I would like to see 
that it gets through all tests and spends a couple days on develop, then I will 
be happy to give this a +1 on Monday if no surprises turn up.

> On Apr 17, 2020, at 1:41 AM, Ju@N  wrote:
> 
> Hello devs,
> 
> I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12* branch.
> The bug is not new, seems quite old actually, but still seems pretty
> critical as it can lead to data loss in WAN topologies.
> Long story short: when there are multiple parallel *gateway-senders* attached
> to the same region and the user decides to stop, detach and *destroy* one
> of the senders (the destroy part is what actually causes the problem), the
> rest of the senders attached will silently stop replicating events to the
> the remote clusters and all these events will be lost.
> The fix has been merged into develop through commit *bfbb398 [2]*.
> Best regards.
> 
> [1]: https://issues.apache.org/jira/browse/GEODE-7940
> [2]:
> https://github.com/apache/geode/commit/bfbb398891c5d96fa3a5975365b29d71bd849ad6
> 
> -- 
> Ju@N



Re: [PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Ju@N
Sounds good Owen, thanks!.

On Fri, 17 Apr 2020 at 09:50, Owen Nichols  wrote:

> Hi Juan, this looks like a great fix and definitely meets the “critical
> fix" standard.  Also thanks for the detailed description.
>
> I noticed it was just merged to develop very recently.  I would like to
> see that it gets through all tests and spends a couple days on develop,
> then I will be happy to give this a +1 on Monday if no surprises turn up.
>
> > On Apr 17, 2020, at 1:41 AM, Ju@N  wrote:
> >
> > Hello devs,
> >
> > I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12*
> branch.
> > The bug is not new, seems quite old actually, but still seems pretty
> > critical as it can lead to data loss in WAN topologies.
> > Long story short: when there are multiple parallel *gateway-senders*
> attached
> > to the same region and the user decides to stop, detach and *destroy* one
> > of the senders (the destroy part is what actually causes the problem),
> the
> > rest of the senders attached will silently stop replicating events to the
> > the remote clusters and all these events will be lost.
> > The fix has been merged into develop through commit *bfbb398 [2]*.
> > Best regards.
> >
> > [1]: https://issues.apache.org/jira/browse/GEODE-7940
> > [2]:
> >
> https://github.com/apache/geode/commit/bfbb398891c5d96fa3a5975365b29d71bd849ad6
> >
> > --
> > Ju@N
>
>

-- 
Ju@N


Re: [PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Joris Melchior
Sounds like something we should include.

+1

On Fri, Apr 17, 2020 at 4:41 AM Ju@N  wrote:

> Hello devs,
>
> I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12* branch.
> The bug is not new, seems quite old actually, but still seems pretty
> critical as it can lead to data loss in WAN topologies.
> Long story short: when there are multiple parallel *gateway-senders*
> attached
> to the same region and the user decides to stop, detach and *destroy* one
> of the senders (the destroy part is what actually causes the problem), the
> rest of the senders attached will silently stop replicating events to the
> the remote clusters and all these events will be lost.
> The fix has been merged into develop through commit *bfbb398 [2]*.
> Best regards.
>
> [1]: https://issues.apache.org/jira/browse/GEODE-7940
> [2]:
>
> https://github.com/apache/geode/commit/bfbb398891c5d96fa3a5975365b29d71bd849ad6
>
> --
> Ju@N
>


-- 
*Joris Melchior *
CF Engineering
Pivotal Toronto
416 877 5427

“Programs must be written for people to read, and only incidentally for
machines to execute.” – *Hal Abelson*



Re: Unable to get behavior described in documentation when using durable native client

2020-04-17 Thread Jacob Barrett
Can you confirm that when log level less than debug that the IOException goes 
away and the client appears to function?

-Jake


> On Apr 17, 2020, at 1:12 AM, Jakov Varenina  wrote:
> 
> Hi Jacob,
> 
> Thanks for your response!
> 
> Regarding GEODE-7944, "Unable to deserialize *membership id* 
> java.io.EOFException" is not logged but thrown, and it breaks processing of 
> QueueConnectionRequest in locator. This reflects in native client with "No 
> locators found" even though they are available. Happens only when native 
> client with subscription is enabled and locator started with 
> --log-level=debug.
> 
> I haven't had time to test and analyze in detail native durable client yet. 
> So far I could only confirm that when using native durable client then 
> locator behaves differently than how it is described in documentation (see 
> previous mail) and how java client works:
> 
> It seems that native durable client always requests from locator all 
> available servers (redundant= -1, findDurable=false) with 
> QueueConnectionRequest. Locator returns them in QueueConnectionResponse 
> ordered by load (best...worst). While for java durable client, locator use 
> *membership id *from QueueConnectionRequest to locate servers that host 
> client queue and send them back in QueueConnectionResponse as described in 
> previous mail. I expect that native durable client is handling re-connection 
> to same servers queue somehow also, but this has to be investigated yet. Any 
> hints or comments related to this would be really appreciated.
> 
> BRs,
> 
> Jakov
> 
> On 15. 04. 2020. 10:07, Jacob Barrett wrote:
>> Looking back at history the native library has always only ever set that 
>> findDurable flag to false. I traced it back to its initial commit. Aside 
>> from the annoying log message, does client durable connection work correctly?
>> 
>>> On Apr 14, 2020, at 10:56 PM, Jakov Varenina  
>>> wrote:
>>> 
>>> Hi all,
>>> 
>>> Could you please help me understand behavior of the native client when 
>>> configured as durable?
>>> 
>>> I have been working on a bug GEODE-7944 
>>>  which results with 
>>> exception "Unable to deserialize membership id java.io.EOFException" on 
>>> locator only when debug is enabled. This happens because native client, 
>>> only when subscription is enabled, sends towards locator 
>>> QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID 
>>> (not properly serialized) and therefore exception occurs when locator tries 
>>> to deserialize membership id to log it at debug level.
>>> 
>>> I was trying to figure out why would locator need ClientProxyMembershipID 
>>> from native client and found following paragraph in the documentation 
>>> (copied from 
>>> https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html):
>>> 
>>>   /For durable subscriptions, the server locator must be able to
>>>   locate the servers that host the queues for the durable client. When
>>>   a durable client sends a request, the server locator queries all the
>>>   available servers to see if they are hosting the subscription region
>>>   queue for the durable client. If the server is located, the client
>>>   is connected to the server hosting the subscription region queue./
>>> 
>>> Locator behaves as described in above paragraph only when it receives 
>>> ///QueueConnectionRequest with ///findDurable flag set to "true" //and with 
>>> valid membership i//d. //I noticed that unlike java client, the native 
>>> client always sets //findDurable// to //"false" //and therefore locator 
>>> will never behave as described in above paragraph when native client is 
>>> used.
>>> 
>>> Does anybody know why native client always sets //findDurable=false//?
>>> 
>>> BRs,
>>> 
>>> Jakov



Re: [PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Udo Kohlmeyer
Agreed… this definitely meets the inclusion requirements.

+1
On Apr 17, 2020, 1:50 AM -0700, Owen Nichols , wrote:
Hi Juan, this looks like a great fix and definitely meets the “critical fix" 
standard. Also thanks for the detailed description.

I noticed it was just merged to develop very recently. I would like to see that 
it gets through all tests and spends a couple days on develop, then I will be 
happy to give this a +1 on Monday if no surprises turn up.

On Apr 17, 2020, at 1:41 AM, Ju@N  wrote:

Hello devs,

I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12* branch.
The bug is not new, seems quite old actually, but still seems pretty
critical as it can lead to data loss in WAN topologies.
Long story short: when there are multiple parallel *gateway-senders* attached
to the same region and the user decides to stop, detach and *destroy* one
of the senders (the destroy part is what actually causes the problem), the
rest of the senders attached will silently stop replicating events to the
the remote clusters and all these events will be lost.
The fix has been merged into develop through commit *bfbb398 [2]*.
Best regards.

[1]: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-7940&data=02%7C01%7Cudo%40vmware.com%7C047380dae2c8446c51bb08d7e2ac6d2f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637227102510073512&sdata=fvasWs%2BOiBY4QNbK%2FoxGRaFUkG4tXhyV2ZGhM8yR%2FTs%3D&reserved=0
[2]:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fcommit%2Fbfbb398891c5d96fa3a5975365b29d71bd849ad6&data=02%7C01%7Cudo%40vmware.com%7C047380dae2c8446c51bb08d7e2ac6d2f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637227102510073512&sdata=cX50QNmPVs2quaKoBuEdoiXaB%2B%2BNSjkJjif7ecHUSXc%3D&reserved=0

--
Ju@N



Re: Unable to get behavior described in documentation when using durable native client

2020-04-17 Thread Jacob Barrett
This log message about the GFE version is suspicious too. That is a VERY old 
version with ordinal value 1. The C++ client is supposed to be sending 45 which 
is GFE 9.0.0 / Geode 1.0.0.

Looks like there might be a handshake protocol misalignment somewhere. Will 
keep digging.

-Jake

> On Apr 17, 2020, at 7:24 AM, Jacob Barrett  wrote:
> 
> Can you confirm that when log level less than debug that the IOException goes 
> away and the client appears to function?
> 
> -Jake
> 
> 
>> On Apr 17, 2020, at 1:12 AM, Jakov Varenina  wrote:
>> 
>> Hi Jacob,
>> 
>> Thanks for your response!
>> 
>> Regarding GEODE-7944, "Unable to deserialize *membership id* 
>> java.io.EOFException" is not logged but thrown, and it breaks processing of 
>> QueueConnectionRequest in locator. This reflects in native client with "No 
>> locators found" even though they are available. Happens only when native 
>> client with subscription is enabled and locator started with 
>> --log-level=debug.
>> 
>> I haven't had time to test and analyze in detail native durable client yet. 
>> So far I could only confirm that when using native durable client then 
>> locator behaves differently than how it is described in documentation (see 
>> previous mail) and how java client works:
>> 
>> It seems that native durable client always requests from locator all 
>> available servers (redundant= -1, findDurable=false) with 
>> QueueConnectionRequest. Locator returns them in QueueConnectionResponse 
>> ordered by load (best...worst). While for java durable client, locator use 
>> *membership id *from QueueConnectionRequest to locate servers that host 
>> client queue and send them back in QueueConnectionResponse as described in 
>> previous mail. I expect that native durable client is handling re-connection 
>> to same servers queue somehow also, but this has to be investigated yet. Any 
>> hints or comments related to this would be really appreciated.
>> 
>> BRs,
>> 
>> Jakov
>> 
>> On 15. 04. 2020. 10:07, Jacob Barrett wrote:
>>> Looking back at history the native library has always only ever set that 
>>> findDurable flag to false. I traced it back to its initial commit. Aside 
>>> from the annoying log message, does client durable connection work 
>>> correctly?
>>> 
 On Apr 14, 2020, at 10:56 PM, Jakov Varenina  
 wrote:
 
 Hi all,
 
 Could you please help me understand behavior of the native client when 
 configured as durable?
 
 I have been working on a bug GEODE-7944 
  which results with 
 exception "Unable to deserialize membership id java.io.EOFException" on 
 locator only when debug is enabled. This happens because native client, 
 only when subscription is enabled, sends towards locator 
 QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID 
 (not properly serialized) and therefore exception occurs when locator 
 tries to deserialize membership id to log it at debug level.
 
 I was trying to figure out why would locator need ClientProxyMembershipID 
 from native client and found following paragraph in the documentation 
 (copied from 
 https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html):
 
  /For durable subscriptions, the server locator must be able to
  locate the servers that host the queues for the durable client. When
  a durable client sends a request, the server locator queries all the
  available servers to see if they are hosting the subscription region
  queue for the durable client. If the server is located, the client
  is connected to the server hosting the subscription region queue./
 
 Locator behaves as described in above paragraph only when it receives 
 ///QueueConnectionRequest with ///findDurable flag set to "true" //and 
 with valid membership i//d. //I noticed that unlike java client, the 
 native client always sets //findDurable// to //"false" //and therefore 
 locator will never behave as described in above paragraph when native 
 client is used.
 
 Does anybody know why native client always sets //findDurable=false//?
 
 BRs,
 
 Jakov
> 



Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-17 Thread Kirk Lund
Memcached IntegrationJUnitTest hangs the PR IntegrationTest job because
Cache.close() calls GeodeMemcachedService.close() which again calls
Cache.close(). Looks like the code base has lots of Cache.close() calls --
all of them could theoretically cause issues. I hate to add
ThreadLocal
isClosingThread or something like it just to allow reentrant calls to
Cache.close().

Mark let the IntegrationTest job run for 7+ hours which shows the hang in
the Memcached IntegrationJUnitTest. (Thanks Mark!)

On Thu, Apr 16, 2020 at 1:38 PM Kirk Lund  wrote:

> It timed out while running OldFreeListOffHeapRegionJUnitTest but I think
> the tests before it were responsible for the timeout being exceeded. I
> looked through all of the previously run tests and how long each but
> without having some sort of database with how long each test takes, it's
> impossible to know which test or tests take longer in any given PR.
>
> The IntegrationTest job that exceeded the timeout:
> https://concourse.apachegeode-ci.info/builds/147866
>
> The Test Summary for the above IntegrationTest job with Duration for each
> test:
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-4963/test-results/integrationTest/1587061092/
>
> Unless we want to start tracking each test class/method and its Duration
> in a database, I don't see how we could look for trends or changes to
> identify test(s) that suddenly start taking longer. All of the tests take
> less than 3 minutes each, so unless one suddenly spikes to 10 minutes or
> more, there's really no way to find the test(s).
>
> On Thu, Apr 16, 2020 at 12:52 PM Owen Nichols  wrote:
>
>> Kirk, most IntegrationTest jobs run in 25-30 minutes, but I did see one <
>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-pr/jobs/IntegrationTestOpenJDK11/builds/7202>
>> that came in just under 45 minutes but did succeed.  It would be nice to
>> know what test is occasionally taking longer and why…
>>
>> Here’s an example of a previous timeout increase (Note that both the job
>> timeout and the callstack timeout should be increased by the same amount):
>> https://github.com/apache/geode/pull/4231
>>
>> > On Apr 16, 2020, at 10:47 AM, Kirk Lund  wrote:
>> >
>> > Unfortunately, IntegrationTest exceeds timeout every time I trigger it.
>> The
>> > cause does not appear to be a specific test or hang. I
>> > think IntegrationTest has already been running very close to the timeout
>> > and is exceeding it fairly often even without my changes in #4963.
>> >
>> > Should we increase the timeout for IntegrationTest? (Anyone know how to
>> > increase it?)
>>
>>


Re: About Geode rolling downgrade

2020-04-17 Thread Alberto Gomez
Hi Bruce,

Thanks a lot for your answer. We had not thought about the changes in 
distributed algorithms when analyzing rolling downgrades.

Rolling downgrade is a pretty important requirement for our customers so we 
would not like to close the discussion here and instead try to see if it is 
still reasonable to propose it for Geode maybe relaxing a bit the expectations 
and clarifying some things.

First, I think supporting rolling downgrade does not mean making it impossible 
to upgrade distributed algorithms. It means that you need to support the new 
and the old algorithms (just as it is done today with rolling upgrades) in the 
upgraded version and also support the possibility of switching to the old 
algorithm in a fully upgraded system.

Second of all, I would say it is not very common to upgrade distributed 
algorithms, or at least, it does not seem to have been the case so far in 
Geode. Therefore, the burden of adding the logic to support the rolling 
downgrade would not be something to be carried in every release. In my opinion, 
it will be some extra percentage of work to be added to the work to support the 
rolling upgrade of the algorithm as the rolling downgrade will probably be 
using the mechanisms implemented for the rolling upgrade.

Third, we do not need to support the rolling downgrade from any release to any 
other older release. We could just support the rolling downgrade (at least when 
distributed algorithms are changed) between consecutive versions. They could be 
considered special cases like those when it is required to provide a tool to 
convert files in order to assure compatibility.

-Alberto



From: Bruce Schuchardt 
Sent: Thursday, April 16, 2020 5:04 PM
To: dev@geode.apache.org 
Subject: Re: About Geode rolling downgrade

-1

Another reason that we should not support rolling downgrade is that it makes it 
impossible to upgrade distributed algorithms.

When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

You could support rolling downgrade in this scenario with extra logic and extra 
testing, but I don't think that will always be the case.  Rolling downgrade 
support would place an immense burden on developers in extra development and 
testing in order to ensure that older algorithms could always be brought back 
on-line.

On 4/16/20, 4:24 AM, "Alberto Gomez"  wrote:

Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locator files
  - Start locator in older version

- For each server:
  - Stop server
  - Remove server files
  - Revoke missing-disk-stores for server
  - Start server in older version

Some extra details about this procedure:
- The starting and stopping of processes may not be able to be done using 
gfsh as gfsh does not allow to manage members in a different version than its 
own.
- Redundancy in servers is required
- More than one locator is required
- The allow_old_members_to_join_for_testing needs to be passed to the 
members.

I would like to ask two questions regarding this procedure:
- Do you see any issue not considered by this procedure or any alternative 
to it?
- Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

Thanks in advance for your answers.

Best regards,

-Alberto G.


[1]
 
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%3cb080e98c-5df4-e494-dcbd-383f6d979...

Re: Concurrent tests hitting OOME, hangs, etc

2020-04-17 Thread Kirk Lund
AvailableConnectionManagerConcurrentTest is passing after changing it to
use plain java coded stubs (no Mockito) which required some changes to
Connection and PooledConnection (internal classes). The other
ConcurrentTests are passing on the Mockito upgrade PR branch.

Locally, AvailableConnectionManagerConcurrentTest was taking 10+ minutes to
run, while in the cloud it was causing IntegrationTest to hang and/or hit
OOME. After changing it to use plain java coded stubs it passes locally in
11 seconds.

The Cache.close() branch is unrelated. I mistook the 10+ minute run
of AvailableConnectionManagerConcurrentTest to indicate that it the
ConcurrentTest run a looong time even on develop (but they don't). The hang
on that branch is also in IntegrationTest but it's caused by reentrant
calls to Cache.close().

After Jake's PR is merged to develop, I'll rebase the Mockito upgrade on
develop -- so far it looks good after these latest changes
for AvailableConnectionManagerConcurrentTest.

On Thu, Apr 16, 2020 at 4:24 PM Dan Smith  wrote:

> I like the idea of a separate concurrency test suite. We have other tests
> of concurrency that aren't using this runner that could also go there -
> maybe we could establish some good conventions and figure out how to give
> more CPU time to these tests. I'd actually like to see *more* tests of
> concurrency at a lower level, so we can catch race conditions sooner.
>
> > Should I disable these tests?
>
> I don't think we should be marking any tests as ignored. Let's figure out
> what is going on and make the appropriate fixes.
>
> Can you clarify why these tests are blocking your Cache.close PR? Also,
> which tests are getting an OOME with the mockito PR?
>
> For a little clarity on the ConcurrentTestRunner, it just runs a test with
> multiple threads some number of times. For example
>
> FilterProfileConcurrencyTest.serializationOfFilterProfileWithConcurrentUpdateShouldSucceed
> is testing a bug some of our customers hit where serializing an object that
> was being concurrently modified failed. I think there is definitely value
> in having those sorts of tests given the type of code we have - a lot of
> classes that are mutable and need to be thread safe.
>
> -Dan
>
>
> On Thu, Apr 16, 2020 at 2:47 PM Jacob Barrett  wrote:
>
> >
> >
> > > On Apr 16, 2020, at 2:16 PM, Kirk Lund  wrote:
> >
> > > Anyone else up for moving them to new src sets? Unfortunately, this
> alone
> > > won't make the tests work with the newer versions of Mockito -- and
> guess
> > > what, they only OOME in the cloud (not locally).
> >
> > I could take a stab at it since I split up all the test types originally.
> > I should be able to brush off those cobwebs.
> >
> > > Should I disable these tests?
> >
> > In an effort to get the Mockito stuff through, yes disable them.
> >
> > > Should I remove the use of ConcurrentTestRunner from these tests
> > > (converting them to simple junit tests)?
> > >
> > > I spent some time studying ConcurrentTestRunner and I'm sorry, but I
> > don't
> > > really see the value in this runner, which is why I want to remove the
> > use
> > > of it and convert these tests to plain junit tests.
> >
> > They have less value as unit tests since there are already unit tests
> that
> > cover them. It does raise a good questions about the value and
> > implementation of micro concurrent testing like this. I think isolating
> > them away is a good step before starting this conversation.
> >
> >
> >
>


Re: About Geode rolling downgrade

2020-04-17 Thread Bruce Schuchardt
Hi Alberto, 

I think that if we want to support limited rolling downgrade some other version 
interchange needs to be done and there need to be tests that prove that the 
downgrade works.  That would let us document which versions are compatible for 
a downgrade and enforce that no-one attempts it between incompatible versions.

For instance, there is work going on right now that introduces communications 
changes to remove UDP messaging.  Once rolling upgrade completes it will shut 
down unsecure UDP communications.  At that point there is no way to go back.  
If you tried it the old servers would try to communicate with UDP but the new 
servers would not have UDP sockets open for security reasons.

As a side note, clients would all have to be rolled back before starting in on 
the servers.  Clients aren't equipped to talk to an older version server, and 
servers will reject the client's attempts to create connections.

On 4/17/20, 10:14 AM, "Alberto Gomez"  wrote:

Hi Bruce,

Thanks a lot for your answer. We had not thought about the changes in 
distributed algorithms when analyzing rolling downgrades.

Rolling downgrade is a pretty important requirement for our customers so we 
would not like to close the discussion here and instead try to see if it is 
still reasonable to propose it for Geode maybe relaxing a bit the expectations 
and clarifying some things.

First, I think supporting rolling downgrade does not mean making it 
impossible to upgrade distributed algorithms. It means that you need to support 
the new and the old algorithms (just as it is done today with rolling upgrades) 
in the upgraded version and also support the possibility of switching to the 
old algorithm in a fully upgraded system.

Second of all, I would say it is not very common to upgrade distributed 
algorithms, or at least, it does not seem to have been the case so far in 
Geode. Therefore, the burden of adding the logic to support the rolling 
downgrade would not be something to be carried in every release. In my opinion, 
it will be some extra percentage of work to be added to the work to support the 
rolling upgrade of the algorithm as the rolling downgrade will probably be 
using the mechanisms implemented for the rolling upgrade.

Third, we do not need to support the rolling downgrade from any release to 
any other older release. We could just support the rolling downgrade (at least 
when distributed algorithms are changed) between consecutive versions. They 
could be considered special cases like those when it is required to provide a 
tool to convert files in order to assure compatibility.

-Alberto



From: Bruce Schuchardt 
Sent: Thursday, April 16, 2020 5:04 PM
To: dev@geode.apache.org 
Subject: Re: About Geode rolling downgrade

-1

Another reason that we should not support rolling downgrade is that it 
makes it impossible to upgrade distributed algorithms.

When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

You could support rolling downgrade in this scenario with extra logic and 
extra testing, but I don't think that will always be the case.  Rolling 
downgrade support would place an immense burden on developers in extra 
development and testing in order to ensure that older algorithms could always 
be brought back on-line.

On 4/16/20, 4:24 AM, "Alberto Gomez"  wrote:

Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in 
Geode which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locat

IntelliJ Plugin Assertions2AssertJ

2020-04-17 Thread Kirk Lund
I just started using this IntelliJ plugin and really like it:
https://plugins.jetbrains.com/plugin/10345-assertions2assertj

To install, go to IntelliJ Preferences -> Plugins -> Marketplace. Search
for "Assertions2Assertj" and Install it. You'll have to restart IntelliJ
before you can use it.

To use, just right-click anywhere within a test class and select Refactor
-> Convert Assertions to AssertJ -> Convert current file.

You'll still need to manually convert try-fail-catch blocks to use
catchThrowable or assertThatThrownBy. You should also review any assertions
that are automatically converted.

-Kirk


Checking for a member is still part of distributed system

2020-04-17 Thread Anilkumar Gingade
Is there a better way to know if a member has left the distributed system,
than following:
I am checking using:
"partitionedRegion.getDistributionManager().isCurrentMember(requester));"

This returns true, even though the AdvisorListener on
ParitionedRegion already processed memberDeparted() event.

I want to know if a member has left after invoking the membershipListener.

-Anil.


Re: [PROPOSAL]: GEODE-7940 to support/1.12

2020-04-17 Thread Robert Houghton
Conditional +1 from me too, pending a few days on develop with happy
results :)

Thanks Juan!

On Fri, Apr 17, 2020 at 1:41 AM Ju@N  wrote:

> Hello devs,
>
> I'd like to propose bringing *GEODE-7940 [1]* to the *support/1.12* branch.
> The bug is not new, seems quite old actually, but still seems pretty
> critical as it can lead to data loss in WAN topologies.
> Long story short: when there are multiple parallel *gateway-senders*
> attached
> to the same region and the user decides to stop, detach and *destroy* one
> of the senders (the destroy part is what actually causes the problem), the
> rest of the senders attached will silently stop replicating events to the
> the remote clusters and all these events will be lost.
> The fix has been merged into develop through commit *bfbb398 [2]*.
> Best regards.
>
> [1]: https://issues.apache.org/jira/browse/GEODE-7940
> [2]:
>
> https://github.com/apache/geode/commit/bfbb398891c5d96fa3a5975365b29d71bd849ad6
>
> --
> Ju@N
>


Re: Checking for a member is still part of distributed system

2020-04-17 Thread Kirk Lund
Any requirements for this to be a User API vs internal API?

For internal APIs, you can register a MembershipListener on
DistributionManager -- at least one flavor of which returns a
Set of current members which you could check
before relying on callbacks.

On Fri, Apr 17, 2020 at 3:03 PM Anilkumar Gingade 
wrote:

> Is there a better way to know if a member has left the distributed system,
> than following:
> I am checking using:
> "partitionedRegion.getDistributionManager().isCurrentMember(requester));"
>
> This returns true, even though the AdvisorListener on
> ParitionedRegion already processed memberDeparted() event.
>
> I want to know if a member has left after invoking the membershipListener.
>
> -Anil.
>


Re: Checking for a member is still part of distributed system

2020-04-17 Thread Bruce Schuchardt
AdvisorListener.memberDeparted() is invoked from paths other than membership 
view changes, such as when a Region is destroyed.   A member may still be in 
the cluster (membership view) after AdvisorListener.memberDeparted() has been 
invoked.

If isCurrentMember() returns true then the server is still a member.  It may be 
in the process of being removed but until a new view is installed it's a valid 
member.

Across the cluster there can also be a lag between one server knowing about a 
departure and another server knowing about it.  If you send a message during 
that interval that depends on the departure being known everywhere you can use 
the WaitForViewInstallation message.


On 4/17/20, 3:03 PM, "Anilkumar Gingade"  wrote:

Is there a better way to know if a member has left the distributed system,
than following:
I am checking using:
"partitionedRegion.getDistributionManager().isCurrentMember(requester));"

This returns true, even though the AdvisorListener on
ParitionedRegion already processed memberDeparted() event.

I want to know if a member has left after invoking the membershipListener.

-Anil.





Re: Checking for a member is still part of distributed system

2020-04-17 Thread Anilkumar Gingade
Thanks Kirk.
This is for PR clear; I ended up registering/adding a new membership
listener on DistributionManager (DM).

I was trying to take advantage of MembershipListener on PR region-advisor.
It turns out that this gets called even before the view is updated on DM.

-Anil

On Fri, Apr 17, 2020 at 3:36 PM Kirk Lund  wrote:

> Any requirements for this to be a User API vs internal API?
>
> For internal APIs, you can register a MembershipListener on
> DistributionManager -- at least one flavor of which returns a
> Set of current members which you could check
> before relying on callbacks.
>
> On Fri, Apr 17, 2020 at 3:03 PM Anilkumar Gingade 
> wrote:
>
> > Is there a better way to know if a member has left the distributed
> system,
> > than following:
> > I am checking using:
> > "partitionedRegion.getDistributionManager().isCurrentMember(requester));"
> >
> > This returns true, even though the AdvisorListener on
> > ParitionedRegion already processed memberDeparted() event.
> >
> > I want to know if a member has left after invoking the
> membershipListener.
> >
> > -Anil.
> >
>


Re: Checking for a member is still part of distributed system

2020-04-17 Thread Anilkumar Gingade
Thanks Bruce.
Will take a look at "WaitForViewInstallation".

-Anil.






On Fri, Apr 17, 2020 at 3:44 PM Anilkumar Gingade 
wrote:

> Thanks Kirk.
> This is for PR clear; I ended up registering/adding a new membership
> listener on DistributionManager (DM).
>
> I was trying to take advantage of MembershipListener on PR region-advisor.
> It turns out that this gets called even before the view is updated on DM.
>
> -Anil
>
> On Fri, Apr 17, 2020 at 3:36 PM Kirk Lund  wrote:
>
>> Any requirements for this to be a User API vs internal API?
>>
>> For internal APIs, you can register a MembershipListener on
>> DistributionManager -- at least one flavor of which returns a
>> Set of current members which you could check
>> before relying on callbacks.
>>
>> On Fri, Apr 17, 2020 at 3:03 PM Anilkumar Gingade 
>> wrote:
>>
>> > Is there a better way to know if a member has left the distributed
>> system,
>> > than following:
>> > I am checking using:
>> >
>> "partitionedRegion.getDistributionManager().isCurrentMember(requester));"
>> >
>> > This returns true, even though the AdvisorListener on
>> > ParitionedRegion already processed memberDeparted() event.
>> >
>> > I want to know if a member has left after invoking the
>> membershipListener.
>> >
>> > -Anil.
>> >
>>
>