Cassandra project status, 2023-01-26

Josh McKenzie Thu, 26 Jan 2023 13:41:30 -0800

After a bit of time away, I'm ready to regale you with tales of things you've 
already seen on the dev list and JIRA. ;)

Let's start with calling out that registrations for the Cassandra Summit are
open. Patrick did a better job than I ever could summarizing this in his email
poetically titled "Cassandra Summit update for 2023-01-24", which you can find
here: https://lists.apache.org/thread/7roz6z8nvj9cz8o2jwwo1httl85mwjcs. If you
haven't registered yet and are in the area or receptive to travel, you should
seriously consider going - it's always great to be at a conference with other
people brainstorming, lamenting, and celebrating our shared experiences with
this software project.

>From a technical perspective, there's 2 things I want to call out. One: I want
>to draw everyone's attention to is the epic Mick has put together for an
>effort to make ASF CI not only stable, but also repeatable on other
>containerized cloud-native environments:
>https://issues.apache.org/jira/browse/CASSANDRA-18137

There's a lot of context there, but the high level 4 goals are:
1. Reproducible reference ASF CI environment so contributors can clone it.
2. An accepted “test result output” format that will certify a commit
regardless of CI env.
3. Turnaround times as fast as circleci (cloned environment scales to capacity).
4. Intuitive CI implementation accessible to new contributors.

Ultimately, the ideal best-case would be that we could get away from having 2
CI systems, one of which is a paid-for service, and have a "reproducible
runnable CI" deterministic Thing contributors can run to get insight into their
contributions and their stability. Taking this a logical step further, those of
us that are currently spending money on a paid-for CI system could potentially
better spend that money on a shared CI infrastructure that the entire project
could use and benefit from.

Quite a bit of work has fallen out from that epic and is linked from the
ticket; please take 5 minutes to scan through the ticket and some of the
sub-tasks so it's at least on your radar. Stable public CI is something we've
struggled with for _years_, but we've made huge strides in the past year and my
intuition tells me there's a light at the end of this tunnel.

Mick also hit up the dev ML w/a thread on this offering more context:
https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4

The second thing: The Build Lead role! We need volunteers:
https://cwiki.apache.org/confluence/display/CASSANDRA/Build+Lead. So the TL;DR
on this and why you should consider it: it takes 30-60 minutes *for the entire
week*, it helps us stay on top of our CI infrastructure and test failures,
you'll receive the undying gratitude of many of us on the project, and you also
get some insight into interesting dark corners of the CI infra and testing
system you might otherwise never have known about. You don't need to triage or
attribute failures in the role unless you really want to; getting them
reflected in JIRA is the low hanging fruit here.

[New Contributors Getting Started]
(Unassigned) (Starter Tickets): this is the set of filters you want to pull
from on our project's Kanban board:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2162&quickFilter=2160

We have 26 issues in 4.x (22 really; looks like there's 4 either in progress or
review that need assignee tidied up). 8 issues in 4.0.x, and another 5 floating
around there.

If any of those catch your fancy, join us in the #cassandra-dev channel on
https://the-asf.slack.com (reply to me on this email if you need an invite for
your account), and hit up the @cassandra_mentors alias to reach 13 of us just
waiting with bated breath to help you get oriented. :)

And hey, if any of these _don't_ catch your fancy but you're still interested
in the project and are looking for something interesting to get involved with,
just hop in the slack channel and raise the :batsignal:

[Dev mailing list]
So it's been... a bit. Since I sent the last project status update. Thankfully
it's the holiday season so we didn't accumulate a crushing load of things I
have to summarize for us here:
https://lists.apache.org/list?dev@cassandra.apache.org:dfr=2022-12-19|dto=2023-1-23:

The vote for the Trie-indexed SSTable format passed about a month ago -
congratulations Branimir and team!
https://lists.apache.org/thread/d4sr3jkt4xjn86xrf9h708y6s7lc53v5

I sent out an email discussing taking the smallest concievable baby steps in
formalizing performance testing for the project here:
https://lists.apache.org/thread/kzbv632tm0j99mg10z24wb8f09z0r81z. It seems like
the general consensus is that there's _a lot_ of appetite to engage on this
topic and interesting ideas, and most people aren't all that interested in (nor
disagreeing with) the bare bones v1 of "get a repeatable test with a repeatable
runtime env setup and iterate from there". I think the real challenge here will
be us having discipline in keeping the "community wants to run fast with this"
and "our reference workloads are going to be simple and boring and minimal"
axes of this separate. I have faith in us.

Stefan Miklosovic hit up the list about including mockito-inline in our testing
deps: https://lists.apache.org/thread/tpswxpgdpvj4lfovk4gj9dxyqyjtwv6w. I
immediately had a vision of stubbed in mocks of all our static state and unit
tests that tested actual units and weren't integration tests. It was blinding,
and it was glorious.

Abe Ratnofsky inquired about whether or not the driver donation is still in
flight for CEP-22:
https://lists.apache.org/thread/fsc7gzdhtm2lcb268556jqorkm8l6mmq. Jeremy has
indicated that it is, in fact, still in flight. :)

G1GC is the new default! See:
https://issues.apache.org/jira/browse/CASSANDRA-18027, and the ML thread here:
https://lists.apache.org/thread/8y84ncg51y77g302zp6y9dnp83fnw9rl

The topic of downgrading SSTables has come up from Claude and now Jacek:
https://lists.apache.org/thread/mwkvlno69mzb4thzhvd6gkntdqc6oypk. Not much
engagement on the ML thread here yet; I'd be interested if there's other
opinions out there in the community about this as it seems like a key part of
our "safe to upgrade" story for users.

Benedict Elliott Smith hit up the dev ML about how we handle intra-project
dependencies: https://lists.apache.org/thread/tdvxhogy4m3hrm08421211kwvf5y1c1n.
There's a lot of context and nuance on this thread; I'm not going to even
_attempt_ to summarize anything here. Suffice to say, we're all wrestling with
determining what the Best Worst Option is for us as a project. As with most
build tooling (and technology in general for that matter), there's rough edges
and aspects of solutions that don't quite fit right and are going to be ongoing
niggling annoyances. Seems like we're narrowing down on a next step and
direction on that thread.

When and how to merge the feature branch for CEP-15 has come up here:
https://lists.apache.org/thread/wd8cbc1ox3okxxy2m7322x4bgmrt8068. It's an
excellent question; this is the first CEP-based work where we're facing this
question of "do we wait until the entire thing is 'complete' on the feature
branch to merge it or do we merge it in earlier and then work it on trunk"?
There's definite tradeoffs to either approach; please hit up the thread if you
have an opinion on the topic that hasn't yet been voiced.

Looks like we're growing up and we're going to have a Publicity and Marketing
group: https://lists.apache.org/thread/cd75fx95l827bdf7om9v52zp414vjmv1. Most
engineers I know aren't that interested in the topic and at best find it mildly
distasteful, but if you're the rare bird that finds the topic of marketing,
messaging, and storytelling about technology interesting give this thread a
read and consider participating in the group!

[Checking in on CI]
https://butler.cassandra.apache.org/#/

It's been a month. So what's going on?

3.0: 14 -> 13
3.11: 24 -> 13
4.0: 2 -> 4
4.1: 5 -> 6
trunk: 3 -> 12

Pretty good on all branches excepting trunk. If we zoom in on trunk a bit:
https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-trunk/trunk

Lots of timeouts on run 1436 that probably skewed things.

[What's been closed out]
Here's a custom quick filter to give us an overview of 30 days:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2278

5.0: 2 issues:
- Support for CAS and serial read on Accord (CASSANDRA-18100)
- Strict Serializability verification on Accord (CASSANDRA-17112)

4.0.x: 10 issues:
- Prefer snakeyaml's SafeConstructor over Constructor; not a concern for us
w/current usage env but no need not to harden the server to it (CASSANDRA-18150)
- A variety of test fixes
- Make it safe to run nodetool cleanup during bootstrap or decommission
(CASSANDRA-16418)
- Change rat targets to adhere to build.dir property (CASSANDRA-18183) (go go
gadget ram disks?)
- Fix an IllegalArgumentException in query code paths while upgrading C*
(CASSANDRA-17507) - looks like it applies to all supported versions.
- A variety of documentation fixes (17742, 17241)
- Add support for python 3.11 in dtests (CASSANDRA-18121) (CASSANDRA-18088)
- Fix inadvertent revert of user credentials on addition of new nodes to
cluster (CASSANDRA-12525)
- Fix cqlsh formatting duration incorrectly (CASSANDRA-18141)
- Fix sstable loading of keyspaces named snapshots or backups (CASSANDRA-14013)
- generate.sh needs to stop lying with its return code (CASSANDRA-18032)
- Stop leaking 2015 memtable synthetic Epoch (CASSANDRA-18118)
- Augment intellij git window navigation links to Cassandra's JIRA
(CASSANDRA-18126)

4.x: 11 issues
- Reverted changes in units output in FileUtils#stringifyFileSize
(CASSANDRA-18139)
- Give clear error when certain nodetool commands are issued before server is
ready (CASSANDRA-11537)
- The source code must obey the avoid star import checkstyle rule
(CASSANDRA-18089). IT MUST OBEY. :D
- Use G1GC as default (CASSANDRA-18027)
- CASTest failure fix (CASSANDRA-18164)
- Allow SimpleSeedProvider to resolve multiple IPs per DNS name
(CASSANDRA-14361)
- Add compaction type output result for nodetool compactionhistory
(CASSANDRA-18061)
- mockito-inline causes tests to fail beacause
o.a.c.distributed.mock.nodetool.InternalNodeProbe spies on StorageServiceMBean
(CASSANDRA-18152)
- Remove ProtocolVersion entirely from the CollectionSerializer ecosystem
(CASSANDRA-18114). Deleting code makes me happy.
- Streaming progress virtual table lock contention can trigger TCP_USER_TIMEOUT
and fail streaming (CASSANDRA-18110)

Phew. So the downside to waiting this long is when we close a lot of tickets,
this last section takes awhile. Thanks for sticking with me to the end folks!

~Josh

Cassandra project status, 2023-01-26

Reply via email to