Cassandra project biweekly status update 2021-11-08

Joshua McKenzie Mon, 08 Nov 2021 12:46:27 -0800

First off - Congrats again to Sumanth Pasupuleti on becoming a committer on
the project! Well deserved; looking forward to working with you further.


It looks like ponymail got an upgrade; I didn't even realize that was
possible at this point. :) So caveat emptor: the links I put in here to
individual email threads are different than in the past but appear to be
working.

[New contributors getting started]
There's been some discussion about whether the #cassandra-dev channel with
600 people in it is the best place for new contributors to get involved and
publicly ask beginner questions or whether we should start a new channel
with a somewhat more limited scope. Please chime in on that dev mailing
list thread if you have an opinion:
https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz

As a new contributor we recommend starting in one of two places: Failing
tests, or starter tickets we label "lhf" (low hanging fruit).
Query for failing tests:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252
Query for unassigned starter tickets:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2162&quickFilter=2160

We're up from 18 unassigned test failures to 22 in the past couple of
weeks. David Capwell, Berenguer Blasi, and Ekaterina Dimitrova (and
others!) have been doing some great work both surfacing failures as well as
fixing things - thank you!

For unassigned lhf, we're up from 10 to 11 on 4.0.2 (our next minor
release) and up from 13 to 14 on 4.1.0 (our next major release). Feel free
to self-select from that list, hit up this email thread or list if you want
some guidance on where to get involved, ping in the #cassandra-dev slack
channel on the-asf.slack.com server, or email or message me directly if you
want any help.

[Dev list discussions in the past 14 days]
https://lists.apache.org/[email protected]:lte=2w:

We have an ongoing discussion about what it means to have a releasable
trunk and what steps, if any, it'd take to get there. Given the scale and
complexity of this project and its testing infrastructure, I'm curious to
hear what other experiences people have had with applying select CI and CD
principles to an ecosystem like this:
https://lists.apache.org/thread/kyyo5k3my2nx160mfgy0xkwo8xjh2qpv

As mentioned above, there's an ongoing discussion about how to make the
cassandra dev community more welcoming for newcomers:
https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz

Andres surfaced CEP-3 for guardrails in which we all professed our
continued love for JMX (especially you Patrick). It'd be great to see more
operators chime in with their experience running clusters at scale and the
type of anti-patterns of usage that destabilize clusters since guardrails
would be a great way to expose protection against frequently occurring
patterns that scales poorly, among other things (tombstone heavy workloads
and thousands of tables anyone?)


CEP-18: Improving Modularity is going to be deprecated in favor of
module-specific refactors and optional implementations.

CEP-17: SSTable format API is evolving nicely:
https://lists.apache.org/thread/boqb5trkq1q38rmb50p4lsw95hyv053m

And these are just the highlights!

[Tickets in the past 14 days]
On the 4.0.2 front we've closed out 5 tickets compared to 9 in the prior 2
weeks. Looks like permissions, some timeouts during replica failure,
website updates, etc.

For 4.1.0 we've closed out 8 issues down from 14. Some stability in schema
pulls, commit log stability during testing, a slew of test fixes, and a new
feature to allow denying access to configured partition keys for reads,
writes, or range reads based on config (CQL or JMX).

[Tickets that need attention]
Needs Reviewer:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&selectedIssue=CASSANDRA-16547&quickFilter=2259

I've tidied up / created a new quick filter that's tickets that are in
progress, blocked, or patch available but lacking a reviewer. This is
slightly opinionated of me in that it implies we should have reviewers for
things as we work on them rather than once they're further along being
written; I have a bias towards early inclusion of a 2nd pair of eyes and a
sounding board. If you see anything on this list that you're qualified to
review on or know the area of the code-base and have a few cycles, please
take a look and help out.

Workload wise, 14 tickets on 4.0.2 need reviewers and 34 on 4.1.0 by this
definition.

I'm going to refrain from linking to stalled tickets (30d inactive) for
now; the load of that is high (80 on 4.0.2, 422 on 4.1.0) so we probably
should approach this a little differently if we want to tidy up or prune
that backlog. It's as simple as a fixversion flag so doesn't really
indicate _too_ much to worry about.

[Test Failure Trendlines]
So first off, we have a good number of tests in this project. 43,000 or so
now. It's helpful to keep that in mind when we talk about having 5, 10, or
even 50 test failures relative to the total corpus. Unfortunately,
databases are like compilers in that they're rather unforgiving of even a
.125% failure rate.

So what's our test failure trend? We have 2 trendlines of interest:
1) The documented JIRA-ticket created test failures on the project:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90

We can see where I got feisty creating test failure tickets when trying to
merge the Denylist patch a week ago. In general, the volume of "open
tickets for known test failures" has been growing:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90

That said, this could be due to a variety of factors: more failures,
increased discipline around tracking, or even poor hygiene closing out
tickets when we fix the related tests.

2) The metric that I think is a bit cleaner and more informative is our
test failure history on our jenkins build server (assuming I can ever get
it to load /groan):

https://ci-cassandra.apache.org/job/Cassandra-trunk/lastCompletedBuild/testReport/history/

In general we've been pretty clean (meaning single digit failures) since
the 4.0 release; as discussed in another thread, the recent spate of
failures caused by dtest-api dependency changes is being addressed in
CASSANDRA-17050. Silver lining: that situation has surfaced 1) a need for a
discussion and improvement around how we work with dependent projects and
release dependencies in Cassandra (all in one IDE as subprojects vs.
separate projects, release dependencies, etc) and we can expect to see a
DISCUSS thread about that soon, and 2) that there's broader failures going
on with some of the python dtests for a bit here we need to get to the
bottom of.

And that's a wrap folks. I call this one "The Calm Before the Storm" if our
CEP's are any indicator. :)

As always, thanks everyone for the time, effort, and collaboration on the
project.

~Josh

Cassandra project biweekly status update 2021-11-08

Reply via email to