Re: [Proposal] - RFC etiquette

2020-07-13 Thread Udo Kohlmeyer
Hi there Alberto,

I’m merely trying to improve the RFC process. We learn and improve. Get more 
community members to feel empowered to be able to review an RFC, not only from 
the perspective of is it technically feasible but also from a process and 
“business” sense.

Having a better understand of what we expect and component of system to do is 
paramount here, even if just from a high-level.

—Udo
On Jul 10, 2020, 1:08 AM -0700, Alberto Gomez , wrote:
Hi Geode Devs,

First of all, Udo, thanks for your proposal. I am all up for what you are 
aiming at: "better round out each RFC. Causing less delays later in the process 
and allowing all community members to actively participate in the review 
process regardless of technical skill level."

Secondly, I think I am to blame for having given two little time to review the 
latest RFC I have published. I apologize for it. I felt the changes were too 
small, assumed that the solution was not problematic and as a result gave less 
than a week to review which I now think is too little even if the RFC content 
was small. This has probably triggered Udo's proposal so, in a way, it has not 
been such a bad thing 😉.

Regarding the concrete proposal to achieve the goal, I think the 2 week minimum 
period is very reasonable. The new use case section may help to have more 
community members actively participating but I am not sure that it will be the 
definitive measure. I feel that sometimes the lack of participation comes from 
lack of time because we're busy with other things and not so much with how the 
RFC proposal has been written. Anyhow, having an example of what this new 
section should look like would be helpful for new RFCs to be written.

Alberto


From: Udo Kohlmeyer 
Sent: Thursday, July 9, 2020 10:18 PM
To: geode 
Subject: [Proposal] - RFC etiquette

Hi there Geode Dev's

I would like to propose the following changes to the RFC process that we have 
in place at the moment.

1. All submitted RFC’s will provide a minimum 2 week review period. This is to 
allow the community to review the RFC in a reasonable timeframe. If we rush 
things, we will miss things. I’d rather have a little more time spent on the 
RFC review and getting the approach “correct” than rushing the RFC and then at 
a later point in time (either at PR review or worse production issue) find out 
that the approach was less than optimal.
2. Add a new section to the RFC. I would like to propose this section to be 
labelled “Use Cases”. In this section I would like all submitters to describe 
the use case that this RFC is to fulfill. This would include all possible 
combinations (success and failure) and expected outcomes of each.

I hope with the additions to the RFC process and template we can better round 
out each RFC. Causing less delays later in the process and allowing all 
community members to actively participate in the review process regardless of 
technical skill level.

Thoughts or comments?

—Udo


Re: [INFO] Latest test run of 200 DistributedTestOpenJDK8 passes

2020-07-13 Thread Alexander Murmann
We continue to see these WAN tests adding a fail rate of just below 30% in
our mass test runs

.

That's a very significant fail rate that impacts our ability to get our
code committed with confidence.

Can we resolve this issue? Otherwise, I think we need to consider reverting
GEODE-7458.

On Fri, Jun 19, 2020 at 3:28 PM Alexander Murmann 
wrote:

> Looking more into this, it looks like this was introduced by the changes
> for GEODE-7458 - "Adding additional option in gfsh command "start gateway
> sender" to control clearing of existing queues".
>
> That happened about a month ago, but it's inherent to those flaky tests
> that we discover them only after a while. Nonetheless, they become paper
> cuts that ultimately slow us down substantially if they persist.
>
> @Mario Ivanac If I am correct and GEODE-7458 introduced this you were the
> one making that change. Might you be able to take a look at making that
> test more reliable or reverting the change?
>
> Thank you!
>
> On Fri, Jun 19, 2020 at 7:57 AM Alexander Murmann 
> wrote:
>
>> Thank you so much for sharing this, Mark!
>>
>> It looks like there is a big cluster around WAN Gateway. Is anyone
>> already looking into the WAN issues?
>>
>> On Thu, Jun 18, 2020 at 10:06 PM Mark Hanson  wrote:
>>
>>> FYI, the build success rate was around 90% or so about two months ago.
>>>
>>> Here are the DUnit tests that are currently failing in our tests, most
>>> likely in CI, and PR pipelines.
>>>
>>> Please let me know if you have any questions.
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>> ***
>>>
>>>  Overall build success rate: 78.0% (156 of 200)
>>>
>>>
>>> ***
>>>
>>>
>>>
>>> The following test methods see failures in more than one class.  There
>>> may be a failing *TestBase class
>>>
>>>
>>>
>>> *.testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived:
>>> 12 failures :
>>>
>>>   SerialWANPersistenceEnabledGatewaySenderDUnitTest:  8 failures
>>> (96.000% success rate)
>>>
>>>   SerialWANPersistenceEnabledGatewaySenderOffHeapDUnitTest:  4 failures
>>> (98.000% success rate)
>>>
>>>
>>>
>>> *.testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived:
>>> 12 failures :
>>>
>>>   ParallelWANPersistenceEnabledGatewaySenderOffHeapDUnitTest:  5
>>> failures (97.500% success rate)
>>>
>>>   ParallelWANPersistenceEnabledGatewaySenderDUnitTest:  7 failures
>>> (96.500% success rate)
>>>
>>>
>>>
>>> *.testPingWrongServer:  4 failures :
>>>
>>>   ClientServerMiscSelectorDUnitTest:  3 failures (98.500% success rate)
>>>
>>>   ClientServerMiscDUnitTest:  1 failures (99.500% success rate)
>>>
>>>
>>>
>>>
>>> ***
>>>
>>>
>>>
>>>
>>>
>>> org.apache.geode.internal.cache.wan.serial.SerialWANPersistenceEnabledGatewaySenderDUnitTest:
>>> 8 failures (96.000% success rate)
>>>
>>>
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3539
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3526
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3505
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3435
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3414
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3391
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3363
>>>
>>>
>>>  
>>> testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEv

Re: [INFO] Latest test run of 200 DistributedTestOpenJDK8 passes

2020-07-13 Thread Mark Hanson
The previous statement about 30% didn't make sense to me, so I thought I would 
throw in this tidbit...

70 failures out 75 failures in 200 runs are caused by WAN.  

Thanks,
Mark

On 7/13/20, 4:11 PM, "Alexander Murmann"  wrote:

We continue to see these WAN tests adding a fail rate of just below 30% in
our mass test runs


.

That's a very significant fail rate that impacts our ability to get our
code committed with confidence.

Can we resolve this issue? Otherwise, I think we need to consider reverting
GEODE-7458.

On Fri, Jun 19, 2020 at 3:28 PM Alexander Murmann 
wrote:

> Looking more into this, it looks like this was introduced by the changes
> for GEODE-7458 - "Adding additional option in gfsh command "start gateway
> sender" to control clearing of existing queues".
>
> That happened about a month ago, but it's inherent to those flaky tests
> that we discover them only after a while. Nonetheless, they become paper
> cuts that ultimately slow us down substantially if they persist.
>
> @Mario Ivanac If I am correct and GEODE-7458 introduced this you were the
> one making that change. Might you be able to take a look at making that
> test more reliable or reverting the change?
>
> Thank you!
>
> On Fri, Jun 19, 2020 at 7:57 AM Alexander Murmann 
> wrote:
>
>> Thank you so much for sharing this, Mark!
>>
>> It looks like there is a big cluster around WAN Gateway. Is anyone
>> already looking into the WAN issues?
>>
>> On Thu, Jun 18, 2020 at 10:06 PM Mark Hanson  wrote:
>>
>>> FYI, the build success rate was around 90% or so about two months ago.
>>>
>>> Here are the DUnit tests that are currently failing in our tests, most
>>> likely in CI, and PR pipelines.
>>>
>>> Please let me know if you have any questions.
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>> 
***
>>>
>>>  Overall build success rate: 78.0% (156 of 200)
>>>
>>>
>>> 
***
>>>
>>>
>>>
>>> The following test methods see failures in more than one class.  There
>>> may be a failing *TestBase class
>>>
>>>
>>>
>>> 
*.testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived:
>>> 12 failures :
>>>
>>>   SerialWANPersistenceEnabledGatewaySenderDUnitTest:  8 failures
>>> (96.000% success rate)
>>>
>>>   SerialWANPersistenceEnabledGatewaySenderOffHeapDUnitTest:  4 failures
>>> (98.000% success rate)
>>>
>>>
>>>
>>> 
*.testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived:
>>> 12 failures :
>>>
>>>   ParallelWANPersistenceEnabledGatewaySenderOffHeapDUnitTest:  5
>>> failures (97.500% success rate)
>>>
>>>   ParallelWANPersistenceEnabledGatewaySenderDUnitTest:  7 failures
>>> (96.500% success rate)
>>>
>>>
>>>
>>> *.testPingWrongServer:  4 failures :
>>>
>>>   ClientServerMiscSelectorDUnitTest:  3 failures (98.500% success rate)
>>>
>>>   ClientServerMiscDUnitTest:  1 failures (99.500% success rate)
>>>
>>>
>>>
>>>
>>> 
***
>>>
>>>
>>>
>>>
>>>
>>> 
org.apache.geode.internal.cache.wan.serial.SerialWANPersistenceEnabledGatewaySenderDUnitTest:
>>> 8 failures (96.000% success rate)
>>>
>>>
>>>
>>>
>>>  
testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-mass-test-run-main%2Fjobs%2FDistributedTestOpenJDK8%2Fbuilds%2F3539&data=02%7C01%7Chansonm%40vmware.com%7C83cd55b2eaf54ce3cd6808d827820ea9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637302786835802218&sdata=ILB4NvS0zakenbq00pfHjOZlb9pt7n60%2FpjNUV%2FUQp8%3D&reserved=0
>>>
>>>
>>>  
testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived
>>>
>>> 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-mass-test-run-main%2Fjobs%2FDistributedTestOpenJDK8%2Fbuilds%2F3526&data=02%7C01%7