Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Mark Hanson Wed, 09 Jun 2021 11:42:05 -0700

I am agreeing with Kirk's original desire to remote stress-new-test as a 
requirement. All others would remain.


Thanks,
Mark

On 6/9/21, 11:39 AM, "Owen Nichols" <onich...@vmware.com> wrote:

    The concern of holding the discussion via PR comments, rather than the dev 
list, is just that many people will be excluded.  In the past the Geode 
community has viewed overrides as highly exceptional, and perhaps indicative of 
a larger problem, so I feel that any override situation is of interest to the 
entire community.  But at minimum, as long as there's a public recorder of 
decision *somewhere* that can be cited, the individual performing the override 
can prove that they are acting with concensus.

    Marks seems to be proposing something that goes far beyond that, 
essentially arguing that since we now have codeowners, we don't need gating PR 
checks at all, so there would be nothing to override.

    On 6/9/21, 11:22 AM, "Dale Emery" <dem...@vmware.com> wrote:

        Here’s the kind of scenario I’m imagining:

          *   Code owners and other reviewers review the PR in the usual way 
(either before or after the tests finish).
          *   Stress new test fails, perhaps multiple times.
          *   The committer investigates and, upon concluding that the failed 
tests are not related to the change, requests an override from the code owners.
          *   Code owners review the failures and the committer’s 
justification, and decide whether to override the failures.

        In this scenario, the extra burden on code owners arises only at the 
committer’s request.

        Dale

        From: Owen Nichols <onich...@vmware.com>
        Date: Wednesday, June 9, 2021 at 11:15 AM
        To: dev@geode.apache.org <dev@geode.apache.org>
        Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement 
from PRs
        This would substantially increase the burden on codeowners, because now 
in addition to looking at the code itself, they would have to wait for any 
running PR checks to complete, as well as remember to come back and look at the 
PR after any subsequent commits to make sure the checks are still passing.

        On 6/9/21, 10:56 AM, "Mark Hanson" <hans...@vmware.com> wrote:

            I think that process is a bit much. PRs are already a challenge. 
What do people think about code owners being the gate. We can accept by custom 
that there should be no stress-new-test failures. If there is a failure, a code 
owner can request a change or decide to let it go. I think that is sufficient 
all things considered.

            Thanks,
            Mark

            On 6/9/21, 10:43 AM, "Owen Nichols" <onich...@vmware.com> wrote:

                I feel that a traditional [DISCUSS] and [VOTE] on the dev list 
would be sufficient and proper to grant approval for an override.  Any PR 
already needs approval from 1 codeowner per area to merge, so codeowners 
already have a little more say because they hold veto power over the PR.

                In terms of "practicalities of how this would actually work":
                Step 1: start a [DISCUSS] thread explaining the problem and why 
you think an override is justified
                Step 2: if there is concensus, [VOTE]
                Step 3: Myself (or whoever performs the override) must cite a 
link to the vote thread


                On 6/9/21, 10:16 AM, "Dale Emery" <dem...@vmware.com> wrote:

                    I too like #1 best for now… assuming it’s possible to give 
code owners this ability.

                    Coincidentally, about option #3, II was reading the git 
release notes just now, and noticed there’s a new “trailers” feature. It gives 
git the ability to parse “key: value” pairs at the end of a commit message. We 
could potentially (with a sufficiently current version of git) use that to 
exclude a test from a PR stress test run.

                    And, yeah, option #2 brings back the @FlakyTest annotation 
that we worked so hard to eliminate.

                    As Mark said, none of this fixes the underlying problem, 
which I’d frame as: We have too many tests whose results we don’t trust.

                    Dale

                    From: Kirk Lund <kl...@apache.org>
                    Date: Wednesday, June 9, 2021 at 9:59 AM
                    To: dev@geode.apache.org <dev@geode.apache.org>
                    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 
requirement from PRs
                    I do like the suggestions offered up by Dale and would 
encourage (or even
                    plead) with my fellow contributors to consider these:

                      *   Allow code owners to override the block, if they can 
be convinced the
                    > override is justified.
                    >   *   Exclude troublesome tests from stress test runs, 
either via
                    > annotations or via an `assumeThat(…)` that can detect 
that it’s running as
                    > a stress test. Whatever the mechanism for excluding, it 
would be in the
                    > code, and therefore subject to code owner review. (This, 
too, feels overly
                    > broad to me, as it would exclude the test from all stress 
test runs.)
                    >   *   A way to exclude a specific test method from 
running in the stress
                    > tests for a specific PR or commit. I don’t have any ideas 
for how to
                    > declare such an exclusion, but if it could be done via a 
file it would be
                    > subject to code owner review.


                    1) Allow code owners to override the block, if they can be 
convinced the
                    override is justified.

                    After all, if we don't trust our code owners...

                    2-3) Use a custom annotation to exclude the test method or 
test class only
                    from stress-new-test.

                    At first I really liked this idea, but then we end up with 
growing a
                    collection of flaky tests that are excluded in some way from
                    stress-new-test that still occasionally fail in 
distributedTest.

                    #1 really sounds like the best option to me. I believe that 
leaving our
                    stress-new-test process as-is will only discourage everyone 
from fixing one
                    or two flaky tests in a large dunit test. However, I also 
believe that if
                    we give code owners the authority to override 
stress-new-test, then we
                    need to encourage them not to override this too often.

                    On Tue, Jun 8, 2021 at 11:11 AM Dale Emery 
<dem...@vmware.com> wrote:

                    > Maybe we can find a way to relax the requirement, or to 
allow addressing
                    > specific situations like the tangle you find yourself in.
                    >
                    > Removing the requirement altogether feels overly broad. I 
fear it would
                    > allow us to quietly disregard all intermittent test 
failures, and I think
                    > we already quietly (or even actively) disregard way too 
many kinds of
                    > failures.
                    >
                    > I would prefer some way to explicitly disregard only the 
specific test
                    > failures that prevent us from merging, and only with some 
amount of
                    > explicit justification.
                    >
                    > I’m not sure what that would look like. Some half-baked 
possibilities:
                    >
                    >   *   Allow code owners to override the block, if they 
can be convinced
                    > the override is justified.
                    >   *   Exclude troublesome tests from stress test runs, 
either via
                    > annotations or via an `assumeThat(…)` that can detect 
that it’s running as
                    > a stress test. Whatever the mechanism for excluding, it 
would be in the
                    > code, and therefore subject to code owner review. (This, 
too, feels overly
                    > broad to me, as it would exclude the test from all stress 
test runs.)
                    >   *   A way to exclude a specific test method from 
running in the stress
                    > tests for a specific PR or commit. I don’t have any ideas 
for how to
                    > declare such an exclusion, but if it could be done via a 
file it would be
                    > subject to code owner review.
                    >
                    > Dale
                    >
                    > From: Kirk Lund <kl...@apache.org>
                    > Date: Tuesday, June 8, 2021 at 9:33 AM
                    > To: dev@geode.apache.org <dev@geode.apache.org>
                    > Subject: [DISCUSS] Remove stress-new-test-openjdk11 
requirement from PRs
                    > Our requirement for stress-new-test-openjdk11 to pass 
before allowing merge
                    > doesn't really work as intended for fixing distributed 
tests that contain
                    > multiple flaky test methods. In fact, I think it causes 
contributors to
                    > avoid tackling flaky tests.
                    >
                    > I've been working on GEODE-9103: CI Failure:
                    > 
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                    > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622620801%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r0xpSM5ftHCPKIjjTVhgJ9WuCWb0reaH4PhAviQWJNA%3D&amp;reserved=0>
 and was able to fix it.
                    >
                    > However, stress-new-test-openjdk11 then continued to fail 
for other flaky
                    > tests in PutAllClientServerDistributedTest. I managed to 
fix GEODE-9296 and
                    > GEODE-8528 as well. I also tried but have not been able 
to fix GEODE-9242
                    > which remains flaky.
                    >
                    > Unfortunately, I cannot merge any of my fixes for
                    > PutAllClientServerDistributedTest unless every single 
flaky test in it is
                    > fixed by my PR. I think this is undesirable because it 
would be better to
                    > merge the fix for 3 flaky test methods than none.
                    >
                    > UPDATE: After running my precheckin more times, I did get
                    > stress-new-test-openjdk11 to pass once so I can merge, 
but that's more of a
                    > loophole than anything because I didn't manage to fix 
GEODE-9242.
                    >
                    > Despite having PR #6542 eventually pass, I would like to 
discuss removing
                    > or relaxing our requirement that 
stress-new-test-openjdk11 must pass in
                    > order to merge a PR...
                    >
                    > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                    > PutAllClientServerDistributedTest
                    > <
                    > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622620801%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XO4mOvidSJz4pRuIYg2ryesZt1oQ9oBO2YZ6ysWfwjM%3D&amp;reserved=0
                    > >
                    >
                    > Fixed in PR #6542:
                    > * GEODE-9296: CI Failure: 
PutAllClientServerDistributedTest >
                    > testPartialKeyInPRSingleHopWithRedundancy
                    > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4CNSpaZEOk1fPxglyJoN5vFb9kmW93rBkjcusxcv15s%3D&amp;reserved=0>
                    > * GEODE-9103: CI Failure:
                    > 
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                    > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=V3%2FxEILHSzBiXmbYHsblLbHZ2fNE5Sz84P%2FE2Pj0qA4%3D&amp;reserved=0>
                    > * GEODE-8528: 
PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                    > fails due to missing disk store after server restart
                    > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lMmrMDntB7ydwlbm6tEAO2liSX0sWIgwyudjyfQCpv0%3D&amp;reserved=0>
                    >
                    > Still flaky:
                    > * GEODE-9242: CI failure in 
PutAllClientServerDistributedTest >
                    > testEventIdOutOfOrderInPartitionRegionSingleHop
                    > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4%2FvQFkhPb1yn3PCW15wWWuYOqdL6KrQxB4bvF5giEd8%3D&amp;reserved=0>
                    >
                    > Thanks,
                    > Kirk
                    >

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Reply via email to