Re: Handling packet drop between sites

2020-04-29 Thread Anthony Baker
TCP behaves really poorly in the face of significant packet loss.  You can look 
into tcp_retries1 and tcp_retries2 [1] for some explanations and tuning.  
Eventually, TCP will give up attempting to deliver a packet but this may take 
up to 30min depending on configuration.  IIRC, it’s only at that point that the 
socket signals an error to the JVM.  On top of TCP, you can layer application 
protocols for liveness including timeouts, request/reply semantics, and 
periodic messaging.

I would expect that geode should:

1) Not lose any batch events even if packets get dropped
2) Recover quickly when the network becomes stable again

When a batch is sent to a remote site, it is not dequeued from the sender until 
the destination site sends a response that the batch was delivered without 
error [2].

Note also that the log message below does not strictly indicate a hang, it 
could be just making progress slowly.

HTH and looking forward to the results of your investigations.

Anthony


[1] https://linux.die.net/man/7/tcp
[2] There is a corner case if the destination is over the critical threshold


On Apr 28, 2020, at 6:25 AM, Mario Kevo 
mailto:mario.k...@est.tech>> wrote:

Hi geode-dev,

I have a question about how Geode handle when some packets from batch is 
dropped.
I create Geode WAN with two sites and established replication between them. 
Also modified iptables to drop all packets that comes to receiver port.
In that case I have that some threads are stucked. Seems like gw sender never 
received any response back.
[warn 2020/04/27 13:19:04.667 CEST  tid=0x11] Thread 128 (0x80) 
is stuck

[warn 2020/04/27 13:19:04.669 CEST  tid=0x11] Thread <128> 
(0x80) that was executed at <27 Apr 2020 13:18:13 CEST> has been stuck for 
<50.997 seconds> and number of thread monitor iteration <1>
Thread Name  state 
Executor Group 
Monitored metric 
Thread stack:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:607)
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
org.apache.geode.distributed.internal.tcpserver.TcpSocketCreatorImpl.connect(TcpSocketCreatorImpl.java:59)
org.apache.geode.distributed.internal.tcpserver.ClientSocketCreatorImpl.connect(ClientSocketCreatorImpl.java:54)
org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:94)
org.apache.geode.cache.client.internal.ConnectionConnector.connectClientToServer(ConnectionConnector.java:75)
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:118)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:206)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:216)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:326)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:329)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303)
org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839)
org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:276)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Also, I tried to run the same test with 200K entries and drop 70% of packets 
and see that exception is again there and it takes approx. 40min to transmit 
all entries to another site.

How Geode handle dropping some packets from the batch? Does anyone made some 
tests on this behavior?

Thanks,
Mario




Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-29 Thread Kirk Lund
Done and merged to develop. Thanks to Donal for helping me out.

On Tue, Apr 28, 2020 at 3:21 PM Kirk Lund  wrote:

> In addition to PR precheckin jobs, I've also run a full regression against
> the changes to make Cache.close() synchronous. There are failures, but I
> have no idea if they are normal or "ok" failures or not. So I'm not sure
> what to do next with this change unless someone else wants to review the
> Hydra failures. This is the problem with having a bunch of non-open tests
> that we can't really discuss on dev list. Let me know what you guys want to
> do!
>
> On Tue, Apr 21, 2020 at 2:27 PM Kirk Lund  wrote:
>
>> PR #4963 https://github.com/apache/geode/pull/4963 is ready for review.
>> It has passed precheckin once. After self-review, I reverted a couple small
>> changes that weren't needed so it needs to go through precheckin again.
>>
>> On Fri, Apr 17, 2020 at 9:42 AM Kirk Lund  wrote:
>>
>>> Memcached IntegrationJUnitTest hangs the PR IntegrationTest job because
>>> Cache.close() calls GeodeMemcachedService.close() which again calls
>>> Cache.close(). Looks like the code base has lots of Cache.close() calls
>>> -- all of them could theoretically cause issues. I hate to add 
>>> ThreadLocal
>>> isClosingThread or something like it just to allow reentrant calls to
>>> Cache.close().
>>>
>>> Mark let the IntegrationTest job run for 7+ hours which shows the hang
>>> in the Memcached IntegrationJUnitTest. (Thanks Mark!)
>>>
>>> On Thu, Apr 16, 2020 at 1:38 PM Kirk Lund  wrote:
>>>
 It timed out while running OldFreeListOffHeapRegionJUnitTest but I
 think the tests before it were responsible for the timeout being exceeded.
 I looked through all of the previously run tests and how long each but
 without having some sort of database with how long each test takes, it's
 impossible to know which test or tests take longer in any given PR.

 The IntegrationTest job that exceeded the timeout:
 https://concourse.apachegeode-ci.info/builds/147866

 The Test Summary for the above IntegrationTest job with Duration for
 each test:
 http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-4963/test-results/integrationTest/1587061092/

 Unless we want to start tracking each test class/method and its
 Duration in a database, I don't see how we could look for trends or changes
 to identify test(s) that suddenly start taking longer. All of the tests
 take less than 3 minutes each, so unless one suddenly spikes to 10 minutes
 or more, there's really no way to find the test(s).

 On Thu, Apr 16, 2020 at 12:52 PM Owen Nichols 
 wrote:

> Kirk, most IntegrationTest jobs run in 25-30 minutes, but I did see
> one <
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-pr/jobs/IntegrationTestOpenJDK11/builds/7202>
> that came in just under 45 minutes but did succeed.  It would be nice to
> know what test is occasionally taking longer and why…
>
> Here’s an example of a previous timeout increase (Note that both the
> job timeout and the callstack timeout should be increased by the same
> amount): https://github.com/apache/geode/pull/4231
>
> > On Apr 16, 2020, at 10:47 AM, Kirk Lund  wrote:
> >
> > Unfortunately, IntegrationTest exceeds timeout every time I trigger
> it. The
> > cause does not appear to be a specific test or hang. I
> > think IntegrationTest has already been running very close to the
> timeout
> > and is exceeding it fairly often even without my changes in #4963.
> >
> > Should we increase the timeout for IntegrationTest? (Anyone know how
> to
> > increase it?)
>
>


Wiki edit permissions

2020-04-29 Thread Raymond Ingles
My colleague and I are hoping to add a new RFC to the Geode wiki, but we need 
edit permissions to do it. I created an account on the wiki in the name of 
ring...@vmware.com, and Sarah did so with the 
address sabbe...@gmail.com. Who should we contact 
about getting edit permissions to add a new RFC for discussion? Thanks!


Re: Wiki edit permissions

2020-04-29 Thread Kirk Lund
Granting you permissions now... hang on a minute or two.

On Wed, Apr 29, 2020 at 11:20 AM Raymond Ingles  wrote:

> My colleague and I are hoping to add a new RFC to the Geode wiki, but we
> need edit permissions to do it. I created an account on the wiki in the
> name of ring...@vmware.com, and Sarah did so
> with the address sabbe...@gmail.com. Who
> should we contact about getting edit permissions to add a new RFC for
> discussion? Thanks!
>


Re: Wiki edit permissions

2020-04-29 Thread Kirk Lund
You should both have permissions for creating and editing now. Thanks!

On Wed, Apr 29, 2020 at 2:29 PM Kirk Lund  wrote:

> Granting you permissions now... hang on a minute or two.
>
> On Wed, Apr 29, 2020 at 11:20 AM Raymond Ingles 
> wrote:
>
>> My colleague and I are hoping to add a new RFC to the Geode wiki, but we
>> need edit permissions to do it. I created an account on the wiki in the
>> name of ring...@vmware.com, and Sarah did so
>> with the address sabbe...@gmail.com. Who
>> should we contact about getting edit permissions to add a new RFC for
>> discussion? Thanks!
>>
>


[DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Owen Nichols
There are many ways a commit can break the Geode CI pipeline[1] despite our 
required PR checks, such as:
- merge a PR that failed some non-required PR checks
- merge a PR that last ran checks awhile ago and now interacts unexpectedly 
with other recent changes
- merge a PR that fails Windows tests
- merge a PR that fails Benchmarks tests

When a bad commit gets through for whatever reason, what should happen next?  
For example:
- bring it up on the dev list
- someone should revert the change as soon as possible, allowing the pipeline 
to remain green while the issue is addressed
- a reasonable amount of time should be allowed for someone to make a quick 
fix, otherwise revert.
- the pipeline should remain broken for as long as it takes to fix, as long as 
there is clear communication that someone is working on it
- everyone look the other way and hope it fixes itself

I’m sure there are other ideas as well.  A related question is *who* can or 
should revert a bad commit?  Only the person that merged the PR?  Only the 
original author of the PR?  The first person to notice the problem?

What is a reasonable amount of time for this to happen?  2 hours? 2 days? 2 
weeks? Does it depend whether the bad commit is also affecting PR checks for 
everyone else vs only depriving downstream consumers of new Geode -SNAPSHOTs?

Would you take offense if someone else reverted your commit?  Does is make a 
difference if your commit is exposing an existing issue (as opposed to 
introducing a new bug)?

Is there a perception that reverts create a lot of extra work? (they 
shouldn’t--just start your rework PR with a revert of the revert, then add 
additional commits that resolve the issue, so reviewers can easily see what was 
missing the first time)

This is a discussion thread, not a vote.  We trust committers to do what’s 
best.  Would embracing a “anyone can revert, no shame” revert-first-then-fix 
culture benefit our community, or is our current easygoing approach working 
just fine?

[1] 
https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main

Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Robert Houghton
I am very pro-revert for breaking changes. The Benchmark or Windows tests
being a culprit is unfortunate, since the PR pipeline cannot tell us about
those, but thats the cost of our excellence :)

On Wed, Apr 29, 2020 at 4:06 PM Owen Nichols  wrote:

> There are many ways a commit can break the Geode CI pipeline[1] despite
> our required PR checks, such as:
> - merge a PR that failed some non-required PR checks
> - merge a PR that last ran checks awhile ago and now interacts
> unexpectedly with other recent changes
> - merge a PR that fails Windows tests
> - merge a PR that fails Benchmarks tests
>
> When a bad commit gets through for whatever reason, what should happen
> next?  For example:
> - bring it up on the dev list
> - someone should revert the change as soon as possible, allowing the
> pipeline to remain green while the issue is addressed
> - a reasonable amount of time should be allowed for someone to make a
> quick fix, otherwise revert.
> - the pipeline should remain broken for as long as it takes to fix, as
> long as there is clear communication that someone is working on it
> - everyone look the other way and hope it fixes itself
>
> I’m sure there are other ideas as well.  A related question is *who* can
> or should revert a bad commit?  Only the person that merged the PR?  Only
> the original author of the PR?  The first person to notice the problem?
>
> What is a reasonable amount of time for this to happen?  2 hours? 2 days?
> 2 weeks? Does it depend whether the bad commit is also affecting PR checks
> for everyone else vs only depriving downstream consumers of new Geode
> -SNAPSHOTs?
>
> Would you take offense if someone else reverted your commit?  Does is make
> a difference if your commit is exposing an existing issue (as opposed to
> introducing a new bug)?
>
> Is there a perception that reverts create a lot of extra work? (they
> shouldn’t--just start your rework PR with a revert of the revert, then add
> additional commits that resolve the issue, so reviewers can easily see what
> was missing the first time)
>
> This is a discussion thread, not a vote.  We trust committers to do what’s
> best.  Would embracing a “anyone can revert, no shame”
> revert-first-then-fix culture benefit our community, or is our current
> easygoing approach working just fine?
>
> [1]
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main


Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Donal Evans
+1 for a quick revert from me. Preferably by the person whose commit is
causing the issue, since I think it's healthy to encourage a
stigma-free revert process and a sense of individual responsibility for the
health of the pipeline, but theoretically by anyone, as long there's a
concrete idea of who has ownership of the problem following the revert.

As far as the (perceived?) pain of reverting, my recent personal experience
with reverting a troublesome commit has made me even more in favour of
doing it. I hit an unanticipated test ordering issue in the Windows unit
tests after merging a commit and thanks to good communication and
coordination from people who were monitoring the pipeline was able to get
it reverted within 2 hours. The fix was very straightforward, and I was
tempted to leave the problem commit in place and just apply the fix on top
of it, but by swiftly reverting the commit that was exposing the problem, I
stopped several subsequent commits from other committers from getting
blocked in the brief time it took to get the fix verified and merged. The
extra work involved consisted of a few messages asking for PR approvals and
30 seconds in git, so in this instance, at least, it's difficult for me to
see a justification in leaving the pipeline red while working on a fix.

On Wed, Apr 29, 2020 at 4:14 PM Robert Houghton 
wrote:

> I am very pro-revert for breaking changes. The Benchmark or Windows tests
> being a culprit is unfortunate, since the PR pipeline cannot tell us about
> those, but thats the cost of our excellence :)
>
> On Wed, Apr 29, 2020 at 4:06 PM Owen Nichols  wrote:
>
> > There are many ways a commit can break the Geode CI pipeline[1] despite
> > our required PR checks, such as:
> > - merge a PR that failed some non-required PR checks
> > - merge a PR that last ran checks awhile ago and now interacts
> > unexpectedly with other recent changes
> > - merge a PR that fails Windows tests
> > - merge a PR that fails Benchmarks tests
> >
> > When a bad commit gets through for whatever reason, what should happen
> > next?  For example:
> > - bring it up on the dev list
> > - someone should revert the change as soon as possible, allowing the
> > pipeline to remain green while the issue is addressed
> > - a reasonable amount of time should be allowed for someone to make a
> > quick fix, otherwise revert.
> > - the pipeline should remain broken for as long as it takes to fix, as
> > long as there is clear communication that someone is working on it
> > - everyone look the other way and hope it fixes itself
> >
> > I’m sure there are other ideas as well.  A related question is *who* can
> > or should revert a bad commit?  Only the person that merged the PR?  Only
> > the original author of the PR?  The first person to notice the problem?
> >
> > What is a reasonable amount of time for this to happen?  2 hours? 2 days?
> > 2 weeks? Does it depend whether the bad commit is also affecting PR
> checks
> > for everyone else vs only depriving downstream consumers of new Geode
> > -SNAPSHOTs?
> >
> > Would you take offense if someone else reverted your commit?  Does is
> make
> > a difference if your commit is exposing an existing issue (as opposed to
> > introducing a new bug)?
> >
> > Is there a perception that reverts create a lot of extra work? (they
> > shouldn’t--just start your rework PR with a revert of the revert, then
> add
> > additional commits that resolve the issue, so reviewers can easily see
> what
> > was missing the first time)
> >
> > This is a discussion thread, not a vote.  We trust committers to do
> what’s
> > best.  Would embracing a “anyone can revert, no shame”
> > revert-first-then-fix culture benefit our community, or is our current
> > easygoing approach working just fine?
> >
> > [1]
> >
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main
>


Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Joris Melchior
Recent experience makes me lean towards quick revert as well. Takes the 
pressure off when trying to produce a fix and if done soon enough it is usually 
quite straightforward.

From: Owen Nichols 
Sent: April 29, 2020 7:06 PM
To: dev@geode.apache.org 
Subject: [DISCUSS] etiquette around breaking the pipeline

There are many ways a commit can break the Geode CI pipeline[1] despite our 
required PR checks, such as:
- merge a PR that failed some non-required PR checks
- merge a PR that last ran checks awhile ago and now interacts unexpectedly 
with other recent changes
- merge a PR that fails Windows tests
- merge a PR that fails Benchmarks tests

When a bad commit gets through for whatever reason, what should happen next?  
For example:
- bring it up on the dev list
- someone should revert the change as soon as possible, allowing the pipeline 
to remain green while the issue is addressed
- a reasonable amount of time should be allowed for someone to make a quick 
fix, otherwise revert.
- the pipeline should remain broken for as long as it takes to fix, as long as 
there is clear communication that someone is working on it
- everyone look the other way and hope it fixes itself

I’m sure there are other ideas as well.  A related question is *who* can or 
should revert a bad commit?  Only the person that merged the PR?  Only the 
original author of the PR?  The first person to notice the problem?

What is a reasonable amount of time for this to happen?  2 hours? 2 days? 2 
weeks? Does it depend whether the bad commit is also affecting PR checks for 
everyone else vs only depriving downstream consumers of new Geode -SNAPSHOTs?

Would you take offense if someone else reverted your commit?  Does is make a 
difference if your commit is exposing an existing issue (as opposed to 
introducing a new bug)?

Is there a perception that reverts create a lot of extra work? (they 
shouldn’t--just start your rework PR with a revert of the revert, then add 
additional commits that resolve the issue, so reviewers can easily see what was 
missing the first time)

This is a discussion thread, not a vote.  We trust committers to do what’s 
best.  Would embracing a “anyone can revert, no shame” revert-first-then-fix 
culture benefit our community, or is our current easygoing approach working 
just fine?

[1] 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-develop-main&data=02%7C01%7Cjmelchior%40vmware.com%7C147db9dcc029489fb48a08d7ec91f5c1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637237983950472313&sdata=iEWXjnfwt1tUx6s2ODrF5aAoaNuFAphKjIo0ozE0AXU%3D&reserved=0


Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Mark Hanson
As far as I am aware, I think this is already a settled decision. The decision 
was quick revert. 
In almost every case the build is more important than the change.

Thanks,
Mark

> On Apr 29, 2020, at 4:14 PM, Robert Houghton  wrote:
> 
> I am very pro-revert for breaking changes. The Benchmark or Windows tests
> being a culprit is unfortunate, since the PR pipeline cannot tell us about
> those, but thats the cost of our excellence :)
> 
> On Wed, Apr 29, 2020 at 4:06 PM Owen Nichols  > wrote:
> 
>> There are many ways a commit can break the Geode CI pipeline[1] despite
>> our required PR checks, such as:
>> - merge a PR that failed some non-required PR checks
>> - merge a PR that last ran checks awhile ago and now interacts
>> unexpectedly with other recent changes
>> - merge a PR that fails Windows tests
>> - merge a PR that fails Benchmarks tests
>> 
>> When a bad commit gets through for whatever reason, what should happen
>> next?  For example:
>> - bring it up on the dev list
>> - someone should revert the change as soon as possible, allowing the
>> pipeline to remain green while the issue is addressed
>> - a reasonable amount of time should be allowed for someone to make a
>> quick fix, otherwise revert.
>> - the pipeline should remain broken for as long as it takes to fix, as
>> long as there is clear communication that someone is working on it
>> - everyone look the other way and hope it fixes itself
>> 
>> I’m sure there are other ideas as well.  A related question is *who* can
>> or should revert a bad commit?  Only the person that merged the PR?  Only
>> the original author of the PR?  The first person to notice the problem?
>> 
>> What is a reasonable amount of time for this to happen?  2 hours? 2 days?
>> 2 weeks? Does it depend whether the bad commit is also affecting PR checks
>> for everyone else vs only depriving downstream consumers of new Geode
>> -SNAPSHOTs?
>> 
>> Would you take offense if someone else reverted your commit?  Does is make
>> a difference if your commit is exposing an existing issue (as opposed to
>> introducing a new bug)?
>> 
>> Is there a perception that reverts create a lot of extra work? (they
>> shouldn’t--just start your rework PR with a revert of the revert, then add
>> additional commits that resolve the issue, so reviewers can easily see what
>> was missing the first time)
>> 
>> This is a discussion thread, not a vote.  We trust committers to do what’s
>> best.  Would embracing a “anyone can revert, no shame”
>> revert-first-then-fix culture benefit our community, or is our current
>> easygoing approach working just fine?
>> 
>> [1]
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-develop-main&data=02%7C01%7Chansonm%40vmware.com%7C631cbe6f65c14bbb030108d7ec931267%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637237988725318710&sdata=2ecuh535cj7G9i8STCT32zyunsnrcupg4c8dowrGwvo%3D&reserved=0
>>  
>> 


Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Jinmei Liao
1. create the revert PR ASAP
2. work on the fix properly and create the fix PR
3. wait and merge whichever goes green and approved first.

On Wed, Apr 29, 2020 at 7:13 PM Joris Melchior  wrote:

> Recent experience makes me lean towards quick revert as well. Takes the
> pressure off when trying to produce a fix and if done soon enough it is
> usually quite straightforward.
> 
> From: Owen Nichols 
> Sent: April 29, 2020 7:06 PM
> To: dev@geode.apache.org 
> Subject: [DISCUSS] etiquette around breaking the pipeline
>
> There are many ways a commit can break the Geode CI pipeline[1] despite
> our required PR checks, such as:
> - merge a PR that failed some non-required PR checks
> - merge a PR that last ran checks awhile ago and now interacts
> unexpectedly with other recent changes
> - merge a PR that fails Windows tests
> - merge a PR that fails Benchmarks tests
>
> When a bad commit gets through for whatever reason, what should happen
> next?  For example:
> - bring it up on the dev list
> - someone should revert the change as soon as possible, allowing the
> pipeline to remain green while the issue is addressed
> - a reasonable amount of time should be allowed for someone to make a
> quick fix, otherwise revert.
> - the pipeline should remain broken for as long as it takes to fix, as
> long as there is clear communication that someone is working on it
> - everyone look the other way and hope it fixes itself
>
> I’m sure there are other ideas as well.  A related question is *who* can
> or should revert a bad commit?  Only the person that merged the PR?  Only
> the original author of the PR?  The first person to notice the problem?
>
> What is a reasonable amount of time for this to happen?  2 hours? 2 days?
> 2 weeks? Does it depend whether the bad commit is also affecting PR checks
> for everyone else vs only depriving downstream consumers of new Geode
> -SNAPSHOTs?
>
> Would you take offense if someone else reverted your commit?  Does is make
> a difference if your commit is exposing an existing issue (as opposed to
> introducing a new bug)?
>
> Is there a perception that reverts create a lot of extra work? (they
> shouldn’t--just start your rework PR with a revert of the revert, then add
> additional commits that resolve the issue, so reviewers can easily see what
> was missing the first time)
>
> This is a discussion thread, not a vote.  We trust committers to do what’s
> best.  Would embracing a “anyone can revert, no shame”
> revert-first-then-fix culture benefit our community, or is our current
> easygoing approach working just fine?
>
> [1]
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-develop-main&data=02%7C01%7Cjmelchior%40vmware.com%7C147db9dcc029489fb48a08d7ec91f5c1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637237983950472313&sdata=iEWXjnfwt1tUx6s2ODrF5aAoaNuFAphKjIo0ozE0AXU%3D&reserved=0
>


-- 
Cheers

Jinmei


Re: [DISCUSS] etiquette around breaking the pipeline

2020-04-29 Thread Owen Nichols
Hi Mark, I’ve noticed some voluntarily quick reverts, which is awesome, but 
other times CI has stayed broken for days, so it doesn’t feel like we’re all on 
the same page.  I cannot find anything in the wiki or dev list archives to 
suggest this was “settled” previously, but please share a link if I missed 
something.  Consensus that "quick reverts are good” sounds nice, but the 
details of who can notice and initiate the revert make a difference (do we 
really expect every contributor to stare at CI for the next 10 hours after 
their PR is merged?).

> On Apr 29, 2020, at 8:12 PM, Mark Hanson  wrote:
> 
> As far as I am aware, I think this is already a settled decision. The 
> decision was quick revert. 
> In almost every case the build is more important than the change.
> 
> Thanks,
> Mark
> 
>> On Apr 29, 2020, at 4:14 PM, Robert Houghton  wrote:
>> 
>> I am very pro-revert for breaking changes. The Benchmark or Windows tests
>> being a culprit is unfortunate, since the PR pipeline cannot tell us about
>> those, but thats the cost of our excellence :)
>> 
>> On Wed, Apr 29, 2020 at 4:06 PM Owen Nichols > > wrote:
>> 
>>> There are many ways a commit can break the Geode CI pipeline[1] despite
>>> our required PR checks, such as:
>>> - merge a PR that failed some non-required PR checks
>>> - merge a PR that last ran checks awhile ago and now interacts
>>> unexpectedly with other recent changes
>>> - merge a PR that fails Windows tests
>>> - merge a PR that fails Benchmarks tests
>>> 
>>> When a bad commit gets through for whatever reason, what should happen
>>> next?  For example:
>>> - bring it up on the dev list
>>> - someone should revert the change as soon as possible, allowing the
>>> pipeline to remain green while the issue is addressed
>>> - a reasonable amount of time should be allowed for someone to make a
>>> quick fix, otherwise revert.
>>> - the pipeline should remain broken for as long as it takes to fix, as
>>> long as there is clear communication that someone is working on it
>>> - everyone look the other way and hope it fixes itself
>>> 
>>> I’m sure there are other ideas as well.  A related question is *who* can
>>> or should revert a bad commit?  Only the person that merged the PR?  Only
>>> the original author of the PR?  The first person to notice the problem?
>>> 
>>> What is a reasonable amount of time for this to happen?  2 hours? 2 days?
>>> 2 weeks? Does it depend whether the bad commit is also affecting PR checks
>>> for everyone else vs only depriving downstream consumers of new Geode
>>> -SNAPSHOTs?
>>> 
>>> Would you take offense if someone else reverted your commit?  Does is make
>>> a difference if your commit is exposing an existing issue (as opposed to
>>> introducing a new bug)?
>>> 
>>> Is there a perception that reverts create a lot of extra work? (they
>>> shouldn’t--just start your rework PR with a revert of the revert, then add
>>> additional commits that resolve the issue, so reviewers can easily see what
>>> was missing the first time)
>>> 
>>> This is a discussion thread, not a vote.  We trust committers to do what’s
>>> best.  Would embracing a “anyone can revert, no shame”
>>> revert-first-then-fix culture benefit our community, or is our current
>>> easygoing approach working just fine?
>>> 
>>> [1]
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-develop-main&data=02%7C01%7Chansonm%40vmware.com%7C631cbe6f65c14bbb030108d7ec931267%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637237988725318710&sdata=2ecuh535cj7G9i8STCT32zyunsnrcupg4c8dowrGwvo%3D&reserved=0
>>>  
>>>