[DISCUSS] Improving the operational safety and simplicity of in-place major version upgrades

2025-06-10 Thread Paulo Motta
Hi,

One of the most important operational features of Cassandra is how easy it
is (or should be) to do an in-place upgrade. The in-place upgrade procedure
essentially consists of rolling-restarting the cluster while updating the
jar to the new version, while following additional upgrade instructions
from NEWS.txt. In practice, as new features are added and existing features
are extended, the upgrade procedure gets more complex, placing more burden
on operators to ensure a smooth upgrade process.

For example, updating the storage_compatibility_mode from 4.0 to 5.0
requires 3 cluster-wide restarts[1]. Another example is that upgrading to
Cassandra 6.0 prohibits operations like schema changes, node replacement,
bootstrap, decommission, move, assassinate before all the nodes are
migrated to CMS[2]. I don't want to focus on these particular examples,
this is just to illustrate that a lot of manual steps and caution is
required to perform in-place upgrades safely and smoothly.

In order to improve this, I would like to propose extending Cassandra to
allow an operator to register an upgrade intent with the goals of:
a) Tracking the upgrade progress in a system table
b) Verifying the correctness and improving the safety of the upgrade process
c) Performing capability limitation during an upgrade
d) Perform pre and post upgrade actions automatically, when registered in
the upgrade plan by the operator

While there is upgrade awareness in the server, it is mostly reactive and
scattered across different modules (as far as I last seen). A potential
side goal of this effort is to centralize upgrade handling code from
different features in the same module, allowing different features to
specify upgrade pre/post actions/conditions more uniformly. This would
allow for example, developers to specify upgrade constraints via testable
code instead of notes in NEWS.txt, with the hope they will be read by a
careful operator.

The upgrade plan would be registered in a system table and tracked by an
upgrade manager module, that would prevent certain operations (ie. range
movements/schema changes) when an upgrade plan is active or emit
errors/warnings when anomalies are encountered. A few safety/usability
improvements can be enabled when the upgrade plan is registered in the
server, among others:
a) A node could fail startup if it tries to start in a version different
from the one specified in the currently active upgrade plan.
b) If a latency degradation or other SLO degradation is detected while an
upgrade plan is active, then warnings could be emitted allowing operators
to more easily detect upgrade issues.
c) When the upgrade is determined to be completed successfully, nodes can
coordinate running upgrade-sstables or other post-operations according to a
policy specified in the upgrade plan (ie. by rack/dc).

To give an example of what the API would look like, a user wishing upgrade
to upgrade a cluster from version 4.1 to 5.0 would register the upgrade
intent via an API, ie.: nodetool upgradeplan create --target 5.0.4
--disable-schema-changes --post-action upgrade-sstables --post-action
upgrade-storage-compatibility-mode. It would not be possible to create
another upgrade plan if there's a current in progress.

The ultimate goal is that the upgrade process to any version will be as
simple as registering an upgrade plan, and performing a cluster rolling
restart in the desired target version. Any additional actions would be
autonomously coordinated by the servers based on the upgrade progress and
according to the preferences specified in the upgrade plan.

A related, and probably broader, topic is upgrading features. A couple of
examples that come to mind are upgrading Paxos[2] or migrating to
incremental repair[3]. Like version upgrades, these feature upgrades
require a series of steps to be executed on a determined order and
sometimes global coordination. While this suggestion focuses on version
upgrades, it can potentially be extended to track feature upgrades.

I would appreciate your feedback on this draft suggestion to check if it
makes sense before elaborating it on a more detailed proposal, as well as
pointers to other efforts or past proposals that might be related to this.

Thanks,

Paulo

[1] - https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L15-L21
[2] - https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L142C1-L148C19
[3] - https://lists.apache.org/thread/06bl99mt502k7lowd5ont9jtnf5p0t05


Re: [DISCUSS] How we handle JDK support

2025-06-10 Thread Josh McKenzie
Was drafting up a VOTE thread and have one thing I want to add - anyone think 
of any major issues w/this I'm missing?

• All supported branches are *built on the oldest JDK they support*. Our 
support model is: build on single JDK, run on oldest up to current LTS

On Sat, Jun 7, 2025, at 10:53 AM, Josh McKenzie wrote:
> Any other thoughts on this? We think we're ready for a [VOTE] thread?
> 
> On Tue, Jun 3, 2025, at 2:07 PM, Josh McKenzie wrote:
>>> 1) As you said, in trunk, we would go with the latest language level 
>>> features and if we backported and these language features are not present 
>>> there because of older JDK, the patch would need to be rewritten to comply.
>> We already have plenty of cases where we have to write a patch for one 
>> branch, merge -s ours, then --amend in a different patch for the subsequent 
>> branch. I would expect this process (late-discovered multi-branch applying 
>> patches) would be lighter weight than that and far less frequent.
>> 
>> My intuition is that the number of patches that:
>>  1. Are bugfixes or non-invasive improvements we think are trunk only so 
>> leverage latest language level features, that
>>  2. are non-trivial in size or scope, and
>>  3. we later determine we need to backport to older branches
>> Is probably going to be vanishingly small. In most cases when working on a 
>> bug or improvement it's pretty clear where it needs to land day 1.
>> 
>> 
>> On Tue, Jun 3, 2025, at 11:36 AM, Štefan Miklošovič wrote:
>>> Hi Josh,
>>> 
>>> One observation, you wrote:
>>> 
>>> "Backports from trunk to older branches that utilize the latest language 
>>> level would need to be refactored to work on older branches."
>>> 
>>> This is interesting. There are two possibilities
>>> 
>>> 1) As you said, in trunk, we would go with the latest language level 
>>> features and if we backported and these language features are not present 
>>> there because of older JDK, the patch would need to be rewritten to comply.
>>> 
>>> 2) The patch would be written in such a way in trunk that it would use 
>>> language features which would be present in older branches already so a 
>>> backport would be easier in this regard as the code as such would not need 
>>> to be reworked if it does not differ functionally a lot otherwise.
>>> 
>>> It seems to me you have already identified what might happen in your older 
>>> emails here:
>>> 
>>> "Our risk would be patches going to trunk targeting new language features 
>>> we then found out we needed to back-port would require some massaging to be 
>>> compatible with older branches. I suspect that'll be a rare edge-case so 
>>> seems ok?"
>>> 
>>> Totally agree.
>>> 
>>> I do not know what approach I would use by default but thinking about it 
>>> more I would probably do 2) just for the sake of not rewriting it in older 
>>> branches. A patch might be complicated already enough and keeping in mind 
>>> to use newest features in trunk and not using them in older branches is 
>>> just too much of a mental jugglery to handle for me.
>>> 
>>> I guess that the discussion when the usage of newest language features 
>>> would be recommended and justified would follow. I do not think that 
>>> blindly using the latest greatest _everywhere_ is always good when 
>>> maintainers need to take care of it for years in the light of backports and 
>>> bug fixes etc.
>>> 
>>> It probably also depends on how complex the patch is and if using the 
>>> newest language level would yield some considerable performance gains etc.
>>> 
>>> Regards
>>> 
>>> On Tue, Jun 3, 2025 at 4:38 PM Josh McKenzie  wrote:
 __
 I was attempting to convey the language level support for a given branch 
 at any snapshot in time; my reading of Doug's email was for something that 
 shouldn't be a problem with this new paradigm (since we would be doing 
 bugfixes targeting oldest then merge up, and new improvements and features 
 should be going trunk only).
 
 Flowing through time, here's what a new release looks like at time of 
 release. I'll format it as:
 *C* Version / JDK Build Version / JDK Run Version(s) / Language Level*
 
 For our first release:
  • C* 1.0 / 11 / 11 / 11
 Then we release the next C* version; for sake of illustration let's assume 
 a new JDK LTS 12:
  • C* 2.0 / 12 / 12 / 12
  • C* 1.0 / 11 / 11+12 / 11
 Notice: we added the ability to *run* C* 1.0 on JDK12 and otherwise didn't 
 modify the properties of the branch re: JDK's.
 
 Then our 3rd release of C*, same assumption w/JDK LTS 13 support added:
  • C* 3.0 / 13 / 13 / 13
  • C* 2.0 / 12 / 12+13 / 12
  • C* 1.0 / 11 / 11+12+13 / 11
 The ability to run on the new JDK13 is backported to all supported 
 branches. Otherwise: no JDK-related changes.
 
 And upon our 4th release, we drop support for C*1.0:
  • C* 4.0 / 14 / 14 / 14
  • C* 3.0 / 13 / 13+14 / 13
  • C* 2.0 / 12 / 12+

Re: [DISCUSS] How we handle JDK support

2025-06-10 Thread Brandon Williams
This makes sense to me. The release scripts confirm the JDK being used
now, and this was the impetus to add that.

Kind Regards,
Brandon

On Tue, Jun 10, 2025 at 8:09 AM Josh McKenzie  wrote:
>
> Was drafting up a VOTE thread and have one thing I want to add - anyone think 
> of any major issues w/this I'm missing?
>
> • All supported branches are built on the oldest JDK they support. Our 
> support model is: build on single JDK, run on oldest up to current LTS
>
> On Sat, Jun 7, 2025, at 10:53 AM, Josh McKenzie wrote:
>
> Any other thoughts on this? We think we're ready for a [VOTE] thread?
>
> On Tue, Jun 3, 2025, at 2:07 PM, Josh McKenzie wrote:
>
> 1) As you said, in trunk, we would go with the latest language level features 
> and if we backported and these language features are not present there 
> because of older JDK, the patch would need to be rewritten to comply.
>
> We already have plenty of cases where we have to write a patch for one 
> branch, merge -s ours, then --amend in a different patch for the subsequent 
> branch. I would expect this process (late-discovered multi-branch applying 
> patches) would be lighter weight than that and far less frequent.
>
> My intuition is that the number of patches that:
>
> Are bugfixes or non-invasive improvements we think are trunk only so leverage 
> latest language level features, that
> are non-trivial in size or scope, and
> we later determine we need to backport to older branches
>
> Is probably going to be vanishingly small. In most cases when working on a 
> bug or improvement it's pretty clear where it needs to land day 1.
>
>
> On Tue, Jun 3, 2025, at 11:36 AM, Štefan Miklošovič wrote:
>
> Hi Josh,
>
> One observation, you wrote:
>
> "Backports from trunk to older branches that utilize the latest language 
> level would need to be refactored to work on older branches."
>
> This is interesting. There are two possibilities
>
> 1) As you said, in trunk, we would go with the latest language level features 
> and if we backported and these language features are not present there 
> because of older JDK, the patch would need to be rewritten to comply.
>
> 2) The patch would be written in such a way in trunk that it would use 
> language features which would be present in older branches already so a 
> backport would be easier in this regard as the code as such would not need to 
> be reworked if it does not differ functionally a lot otherwise.
>
> It seems to me you have already identified what might happen in your older 
> emails here:
>
> "Our risk would be patches going to trunk targeting new language features we 
> then found out we needed to back-port would require some massaging to be 
> compatible with older branches. I suspect that'll be a rare edge-case so 
> seems ok?"
>
> Totally agree.
>
> I do not know what approach I would use by default but thinking about it more 
> I would probably do 2) just for the sake of not rewriting it in older 
> branches. A patch might be complicated already enough and keeping in mind to 
> use newest features in trunk and not using them in older branches is just too 
> much of a mental jugglery to handle for me.
>
> I guess that the discussion when the usage of newest language features would 
> be recommended and justified would follow. I do not think that blindly using 
> the latest greatest _everywhere_ is always good when maintainers need to take 
> care of it for years in the light of backports and bug fixes etc.
>
> It probably also depends on how complex the patch is and if using the newest 
> language level would yield some considerable performance gains etc.
>
> Regards
>
> On Tue, Jun 3, 2025 at 4:38 PM Josh McKenzie  wrote:
>
>
> I was attempting to convey the language level support for a given branch at 
> any snapshot in time; my reading of Doug's email was for something that 
> shouldn't be a problem with this new paradigm (since we would be doing 
> bugfixes targeting oldest then merge up, and new improvements and features 
> should be going trunk only).
>
> Flowing through time, here's what a new release looks like at time of 
> release. I'll format it as:
> C* Version / JDK Build Version / JDK Run Version(s) / Language Level
>
> For our first release:
>
> C* 1.0 / 11 / 11 / 11
>
> Then we release the next C* version; for sake of illustration let's assume a 
> new JDK LTS 12:
>
> C* 2.0 / 12 / 12 / 12
> C* 1.0 / 11 / 11+12 / 11
>
> Notice: we added the ability to run C* 1.0 on JDK12 and otherwise didn't 
> modify the properties of the branch re: JDK's.
>
> Then our 3rd release of C*, same assumption w/JDK LTS 13 support added:
>
> C* 3.0 / 13 / 13 / 13
> C* 2.0 / 12 / 12+13 / 12
> C* 1.0 / 11 / 11+12+13 / 11
>
> The ability to run on the new JDK13 is backported to all supported branches. 
> Otherwise: no JDK-related changes.
>
> And upon our 4th release, we drop support for C*1.0:
>
> C* 4.0 / 14 / 14 / 14
> C* 3.0 / 13 / 13+14 / 13
> C* 2.0 / 12 / 12+13+14 / 12
>
> The properties this gives us:

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-10 Thread Blake Eggleston
>  Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
> 
> 

This would mean that the base table had either updated or deleted a row and the 
view didn't receive the corresponding delete. 

In the case of a missed update, we'll have a new value and we can send a 
tombstone to the view with the timestamp of the most recent update. Since 
timestamps issued by paxos and accord writes are always increasing 
monotonically and don't have collisions, this is safe. 

In the case of a row deletion, we'd also want to send a tombstone with the same 
timestamp, however since tombstones can be purged, we may not have that 
information and would have to treat it like the view has a higher timestamp 
than the base table.

> Inconsistency (timestamps don’t match) — it’s easy to fix when the base table 
> has higher timestamps, but how do we resolve it when the MV columns have 
> higher timestamps?
> 

There are 2 ways this could happen. First is that a write failed and paxos 
repair hasn't completed it, which is expected, and the second is a replication 
bug or base table data loss. You'd need to compare the view timestamp to the 
paxos repair history to tell which it is. If the view timestamp is higher than 
the most recent paxos repair timestamp for the key, then it may just be a 
failed write and we should do nothing. If the view timestamp is less than the 
most recent paxos repair timestamp for that key and higher than the base 
timestamp, then something has gone wrong and we should issue a tombstone using 
the paxos repair timestamp as the tombstone timestamp. This is safe to do 
because the paxos repair timestamps act as a low bound for ballots paxos will 
process, so it wouldn't be possible for a legitimate write to be shadowed by 
this tombstone.

> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 

No, a normal tombstone would work.

On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
> Okay, let’s put the efficiency discussion on hold for now. I want to make 
> sure the actual repair process after detecting inconsistencies will work with 
> the index-based solution.
> 
> When a mismatch is detected, the MV replica will need to stream its index 
> file to the base table replica. The base table will then perform a comparison 
> between the two files.
> 
> There are three cases we need to handle:
> 
>  1. Missing row in MV — this is straightforward; we can propagate the data to 
> the MV.
> 
>  2. Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
>  3. Inconsistency (timestamps don’t match) — it’s easy to fix when the base 
> table has higher timestamps, but how do we resolve it when the MV columns 
> have higher timestamps?
> 
> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 
> 
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston  wrote:
>> __
>> > hopefully we can come up with a solution that everyone agrees on.
>> 
>> I’m sure we can, I think we’ve been making good progress
>> 
>> > My main concern with the index-based solution is the overhead it adds to 
>> > the hot path, as well as having to build indexes periodically.
>> 
>> So the additional overhead of maintaining a storage attached index on the 
>> client write path is pretty minimal - it’s basically adding data to an in 
>> memory trie. It’s a little extra work and memory usage, but there isn’t any 
>> extra io or other blocking associated with it. I’d expect the latency impact 
>> to be negligible.
>> 
>> > As mentioned earlier, this MV repair should be an infrequent operation
>> 
>> I don’t this that’s a safe assumption. There are a lot of situations outside 
>> of data loss bugs where repair would need to be run. 
>> 
>> These use cases could probably be handled by repairing the view with other 
>> view replicas:
>> 
>> Scrubbing corrupt sstables
>> Node replacement via backup
>> 
>> These use cases would need an actual MV repair to check consistency with the 
>> base table:
>> 
>> Restoring a cluster from a backup
>> Imported sstables via nodetool import
>> Data loss from operator error
>> Proactive consistency checks - ie preview repairs
>> 
>> Even if it is an infrequent operation, when operators need it, it needs to 
>> be available and reliable.
>> 
>> It’s a fact that there are clusters where non-incremental repairs are run on 
>> a cadence of a week or more to manage the overhead of validation 
>> compactions. Assuming the cluster doesn’t have any additional headroom, that 
>> would mean that any one of the above events could cause views to remain out 
>> of sync for up to a week while the full set of merkle trees is being built.
>> 
>> This delay eliminates a lot of the value of repair as a

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-10 Thread Runtian Liu
Okay, let’s put the efficiency discussion on hold for now. I want to make
sure the actual repair process after detecting inconsistencies will work
with the index-based solution.

When a mismatch is detected, the MV replica will need to stream its index
file to the base table replica. The base table will then perform a
comparison between the two files.

There are three cases we need to handle:

   1.

   Missing row in MV — this is straightforward; we can propagate the data
   to the MV.

   2.

   Extra row in MV (assuming the tombstone is gone in the base table) — how
   should we fix this?

   3.

   Inconsistency (timestamps don’t match) — it’s easy to fix when the base
   table has higher timestamps, but how do we resolve it when the MV columns
   have higher timestamps?

Do we need to introduce a new kind of tombstone to shadow the rows in the
MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
should we fix the MV data?

On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
wrote:

> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve been making good progress
>
> > My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically.
>
> So the additional overhead of maintaining a storage attached index on the
> client write path is pretty minimal - it’s basically adding data to an in
> memory trie. It’s a little extra work and memory usage, but there isn’t any
> extra io or other blocking associated with it. I’d expect the latency
> impact to be negligible.
>
> > As mentioned earlier, this MV repair should be an infrequent operation
>
> I don’t this that’s a safe assumption. There are a lot of situations
> outside of data loss bugs where repair would need to be run.
>
> These use cases could probably be handled by repairing the view with other
> view replicas:
>
> Scrubbing corrupt sstables
> Node replacement via backup
>
> These use cases would need an actual MV repair to check consistency with
> the base table:
>
> Restoring a cluster from a backup
> Imported sstables via nodetool import
> Data loss from operator error
> Proactive consistency checks - ie preview repairs
>
> Even if it is an infrequent operation, when operators need it, it needs to
> be available and reliable.
>
> It’s a fact that there are clusters where non-incremental repairs are run
> on a cadence of a week or more to manage the overhead of validation
> compactions. Assuming the cluster doesn’t have any additional headroom,
> that would mean that any one of the above events could cause views to
> remain out of sync for up to a week while the full set of merkle trees is
> being built.
>
> This delay eliminates a lot of the value of repair as a risk mitigation
> tool. If I had to make a recommendation where a bad call could cost me my
> job, the prospect of a 7 day delay on repair would mean a strong no.
>
> Some users also run preview repair continuously to detect data consistency
> errors, so at least a subset of users will probably be running MV repairs
> continuously - at least in preview mode.
>
> That’s why I say that the replication path should be designed to never
> need repair, and MV repair should be designed to be prepared for the worst.
>
> > I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> I think this would be a really reasonable compromise as long as the
> default is on. That way it’s as safe as possible by default, but users who
> don’t care or have a separate system for repairing MVs can opt out.
>
> > I’m not sure what you mean by “data problems” here.
>
> I mean out of sync views - either due to bugs, operator error, corruption,
> etc
>
> > Also, this does scale with cluster size—I’ve compared it to full repair,
> and this MV repair should behave similarly. That means as long as full
> repair works, this repair should work as well.
>
> You could build the merkle trees at about the same cost as a full repair,
> but the actual data repair path is completely different for MV, and that’s
> the part that doesn’t scale well. As you know, with normal repair, we just
> stream data for ranges detected as out of sync. For Mvs, since the data
> isn’t in base partition order, the view data for an out of sync view range
> needs to be read out and streamed to every base replica that it’s detected
> a mismatch against. So in the example I gave with the 300 node cluster,
> you’re looking at reading and transmitting the same partition at least 100
> times in the best case, and the cost of this keeps going up as the cluster
> increases in size. That's the part that doesn't scale well.
>
> This is also one the benefits of the index design. Since it stores data in
> segments that roughly correspond to points on the grid, you’re not
> rereading the same data over and over. A repair for a g