Re: List of serious issues fixed in 3.0.x

Scott Andreas Thu, 07 May 2020 09:56:08 -0700

Sankalp, thanks for sending the spreadsheet and Josh for preparing this 
analysis (pending image issues; look forward to reading)!

I'd encourage everyone involved in the project to review the list of tickets 
captured here. These issues aren't theoretical and represent real scenarios 
that result in data loss, data corruption, incorrect responses to queries, and 
other violations of fundamental properties of the database.

As a community, we've made great progress over the past two years. The focus on 
quality has dramatically improved the safety of Cassandra as a database -- 
especially in the most recent patchlevel releases of the 3.0.x and 3.11.x 
series.

That said, we're also not out of the woods. The following three issues have 
been reported and confirmed genuine in the past week:

– CASSANDRA-15789: Rows can get duplicated in mixed major-version clusters and 
after full upgrade
– CASSANDRA-15778: CorruptSSTableException after a 2.1 SSTable is upgraded to 
3.0, failing reads
– CASSANDRA-15790: EmptyType doesn't override writeValue so could attempt to 
write bytes when expected not to

Regarding Dinesh's point on regression tests, we're beginning to go even 
further. In response to the issues in this spreadsheet, we're evolving new 
approaches toward *active assertion* of data integrity. C-15789 adds 
read/repair/compaction-path detection of primary key duplication, a great way 
to audit and remediate instances of corruption detected in a cluster. Repaired 
data tracking introduced in C-14145 and improvements to Preview Repair are also 
great examples, enabling Cassandra to assert the consistency of repaired data 
(something we'd taken for granted). Active assertion of data integrity 
invariants in Cassandra is an important frontier -- and one we need to explore 
further.

Previously-adopted methodologies like property-based testing, large-scale diff 
tests asserting identity of data between 2.1- and 3.0.x clusters post-upgrade 
via billions of randomized queries, fault injection, model-based tests, CI 
improvements, and flaky test reduction have helped us make huge progress toward 
quality and continue to pay dividends.

I want to thank everyone for their work on safety and stability. It's clear we 
have more ahead, but it's critical to Apache Cassandra's future and toward 
shipping a 4.0 release that users can trust and adopt quickly.

– Scott

________________________________________
From: Joshua McKenzie <joshua.mcken...@gmail.com>
Sent: Thursday, May 7, 2020 9:31 AM
Cc: dev@cassandra.apache.org
Subject: Re: List of serious issues fixed in 3.0.x

Hearing the images got killed by the web server. Trying from gmail (sorry for 
spam). Time to see if it's the apache smtp server or the list culling images:

-------------------------------------------
I did a little analysis on this data (any defect marked with fixversion 4.0 
that rose to the level of critical in terms of availability, correctness, or 
corruption/loss) and charted some things the rest of the project community 
might find interesting:

1: Critical (availability, correctness, corruption/loss) defects fixed per 
month since about 6 months before 3.11.0:
[monthly.png]

2: Components in which critical defects arose (note: bright red bar == sum of 3 
dark red):
[Total Defects by Component.png]

3: Type of defect found and fixed (bright red: cluster down or permaloss, dark 
red: temp corrupt/loss, yellow: incorrect response):

[Total Defects by Type.png]

My personal takeaways from this: a ton of great defect fixing work has gone 
into 4.0. I'd love it if we had both code coverage analysis for testing on the 
codebase as well as data to surface where hotspots of defects are in the code 
that might need further testing (caveat: many have voiced their skepticism of 
the value of this type of data in the past in this project community, so that's 
probably another conversation to have on another thread)

Hope someone else finds the above interesting if not useful.

--
Joshua McKenzie

On Thu, May 7, 2020 at 12:24 PM Joshua McKenzie 
<jmcken...@apache.org<mailto:jmcken...@apache.org>> wrote:
I did a little analysis on this data (any defect marked with fixversion 4.0 
that rose to the level of critical in terms of availability, correctness, or 
corruption/loss) and charted some things the rest of the project community 
might find interesting:

1: Critical (availability, correctness, corruption/loss) defects fixed per 
month since about 6 months before 3.11.0:
[monthly.png]

2: Components in which critical defects arose (note: bright red bar == sum of 3 
dark red):
[Total Defects by Component.png]

3: Type of defect found and fixed (bright red: cluster down or permaloss, dark 
red: temp corrupt/loss, yellow: incorrect response):

[Total Defects by Type.png]

My personal takeaways from this: a ton of great defect fixing work has gone 
into 4.0. I'd love it if we had both code coverage analysis for testing on the 
codebase as well as data to surface where hotspots of defects are in the code 
that might need further testing (caveat: many have voiced their skepticism of 
the value of this type of data in the past in this project community, so that's 
probably another conversation to have on another thread)

Hope someone else finds the above interesting if not useful.

~Josh

On Wed, May 6, 2020 at 3:38 PM Dinesh Joshi 
<djo...@apache.org<mailto:djo...@apache.org>> wrote:
Hi Sankalp,

Thanks for bringing this up. At the very minimum, I hope we have regression 
tests for the specific issues we have fixed.

I personally think, the project should focus on building a comprehensive test 
suite. However, some of these issues can only be detected at scale. We need 
users to test* C* in their environment for their use-cases. Ideally these folks 
stand up large clusters and tee their traffic to the new cluster and report 
issues.

If we had an automated test suite that everyone can run at a large scale that 
would be even better.

Thanks,

Dinesh

* test != starting C* in a few nodes and looking at logs.

> On May 6, 2020, at 10:11 AM, sankalp kohli 
> <kohlisank...@gmail.com<mailto:kohlisank...@gmail.com>> wrote:
>
> Hi,
>    I want to share some of the serious issues that were found and fixed in
> 3.0.x. I have created this list from JIRA to help us identify areas for
> validating 4.0.  This will also give an insight to the dev community.
>
> Let us know if anyone has suggestions on how to better use this data in
> validating 4.0. Also this list might be missing some issues identified
> early on in 3.0.x or some latest ones.
>
> Link: https://tinyurl.com/30seriousissues
>
> Thanks,
> Sankalp

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@cassandra.apache.org<mailto:dev-unsubscr...@cassandra.apache.org>
For additional commands, e-mail: 
dev-h...@cassandra.apache.org<mailto:dev-h...@cassandra.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: List of serious issues fixed in 3.0.x

Reply via email to