Sankalp, thanks for sending the spreadsheet and Josh for preparing this analysis (pending image issues; look forward to reading)!
I'd encourage everyone involved in the project to review the list of tickets captured here. These issues aren't theoretical and represent real scenarios that result in data loss, data corruption, incorrect responses to queries, and other violations of fundamental properties of the database. As a community, we've made great progress over the past two years. The focus on quality has dramatically improved the safety of Cassandra as a database -- especially in the most recent patchlevel releases of the 3.0.x and 3.11.x series. That said, we're also not out of the woods. The following three issues have been reported and confirmed genuine in the past week: – CASSANDRA-15789: Rows can get duplicated in mixed major-version clusters and after full upgrade – CASSANDRA-15778: CorruptSSTableException after a 2.1 SSTable is upgraded to 3.0, failing reads – CASSANDRA-15790: EmptyType doesn't override writeValue so could attempt to write bytes when expected not to Regarding Dinesh's point on regression tests, we're beginning to go even further. In response to the issues in this spreadsheet, we're evolving new approaches toward *active assertion* of data integrity. C-15789 adds read/repair/compaction-path detection of primary key duplication, a great way to audit and remediate instances of corruption detected in a cluster. Repaired data tracking introduced in C-14145 and improvements to Preview Repair are also great examples, enabling Cassandra to assert the consistency of repaired data (something we'd taken for granted). Active assertion of data integrity invariants in Cassandra is an important frontier -- and one we need to explore further. Previously-adopted methodologies like property-based testing, large-scale diff tests asserting identity of data between 2.1- and 3.0.x clusters post-upgrade via billions of randomized queries, fault injection, model-based tests, CI improvements, and flaky test reduction have helped us make huge progress toward quality and continue to pay dividends. I want to thank everyone for their work on safety and stability. It's clear we have more ahead, but it's critical to Apache Cassandra's future and toward shipping a 4.0 release that users can trust and adopt quickly. – Scott ________________________________________ From: Joshua McKenzie <joshua.mcken...@gmail.com> Sent: Thursday, May 7, 2020 9:31 AM Cc: dev@cassandra.apache.org Subject: Re: List of serious issues fixed in 3.0.x Hearing the images got killed by the web server. Trying from gmail (sorry for spam). Time to see if it's the apache smtp server or the list culling images: ------------------------------------------- I did a little analysis on this data (any defect marked with fixversion 4.0 that rose to the level of critical in terms of availability, correctness, or corruption/loss) and charted some things the rest of the project community might find interesting: 1: Critical (availability, correctness, corruption/loss) defects fixed per month since about 6 months before 3.11.0: [monthly.png] 2: Components in which critical defects arose (note: bright red bar == sum of 3 dark red): [Total Defects by Component.png] 3: Type of defect found and fixed (bright red: cluster down or permaloss, dark red: temp corrupt/loss, yellow: incorrect response): [Total Defects by Type.png] My personal takeaways from this: a ton of great defect fixing work has gone into 4.0. I'd love it if we had both code coverage analysis for testing on the codebase as well as data to surface where hotspots of defects are in the code that might need further testing (caveat: many have voiced their skepticism of the value of this type of data in the past in this project community, so that's probably another conversation to have on another thread) Hope someone else finds the above interesting if not useful. -- Joshua McKenzie On Thu, May 7, 2020 at 12:24 PM Joshua McKenzie <jmcken...@apache.org<mailto:jmcken...@apache.org>> wrote: I did a little analysis on this data (any defect marked with fixversion 4.0 that rose to the level of critical in terms of availability, correctness, or corruption/loss) and charted some things the rest of the project community might find interesting: 1: Critical (availability, correctness, corruption/loss) defects fixed per month since about 6 months before 3.11.0: [monthly.png] 2: Components in which critical defects arose (note: bright red bar == sum of 3 dark red): [Total Defects by Component.png] 3: Type of defect found and fixed (bright red: cluster down or permaloss, dark red: temp corrupt/loss, yellow: incorrect response): [Total Defects by Type.png] My personal takeaways from this: a ton of great defect fixing work has gone into 4.0. I'd love it if we had both code coverage analysis for testing on the codebase as well as data to surface where hotspots of defects are in the code that might need further testing (caveat: many have voiced their skepticism of the value of this type of data in the past in this project community, so that's probably another conversation to have on another thread) Hope someone else finds the above interesting if not useful. ~Josh On Wed, May 6, 2020 at 3:38 PM Dinesh Joshi <djo...@apache.org<mailto:djo...@apache.org>> wrote: Hi Sankalp, Thanks for bringing this up. At the very minimum, I hope we have regression tests for the specific issues we have fixed. I personally think, the project should focus on building a comprehensive test suite. However, some of these issues can only be detected at scale. We need users to test* C* in their environment for their use-cases. Ideally these folks stand up large clusters and tee their traffic to the new cluster and report issues. If we had an automated test suite that everyone can run at a large scale that would be even better. Thanks, Dinesh * test != starting C* in a few nodes and looking at logs. > On May 6, 2020, at 10:11 AM, sankalp kohli > <kohlisank...@gmail.com<mailto:kohlisank...@gmail.com>> wrote: > > Hi, > I want to share some of the serious issues that were found and fixed in > 3.0.x. I have created this list from JIRA to help us identify areas for > validating 4.0. This will also give an insight to the dev community. > > Let us know if anyone has suggestions on how to better use this data in > validating 4.0. Also this list might be missing some issues identified > early on in 3.0.x or some latest ones. > > Link: https://tinyurl.com/30seriousissues > > Thanks, > Sankalp --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org<mailto:dev-unsubscr...@cassandra.apache.org> For additional commands, e-mail: dev-h...@cassandra.apache.org<mailto:dev-h...@cassandra.apache.org> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org