Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Benedict
Depending how long the grid structure takes to build, there is perhaps anyway value in being able to update the snapshot after construction, so that when the repair is performed it is as up to date as possible. But, I don’t think this is trivial? I have some ideas how this might be done but they aren’t ideal, and could be costly or error prone. Have you sketched out a mechanism for this?I like the idea that the same mechanism could be used to build a one-off snapshot as maintain a live on, leaving the operator to decide what they prefer. Since this seems like an extension to the original proposal, I would suggest the original proposal is advanced and live updates to the snapshot is developed in follow up work.On 21 May 2025, at 17:45, Blake Eggleston  wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.Yes, it is. FWIW I think the grid structure is the right approach conceptually, but it has some drawbacks as proposed that I think we should try to improve. The simple index approach takes the same high level approach with a different set of tradeoffs2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.That's true. You could force a rebuild of the marker table if you knew there was a problem, and it might even be a good idea to have a process that slowly rebuilds the table in the background. The important part is that there's a low upfront cost to starting a repair, and that individual range pairs can be repaired quickly (by repair standards).Another advantage to having a more lightweight repair mechanism is that we can tolerate riskier client patterns. For instance, if we went with a paxos based approach, I think LOCAL_SERIAL writes would become acceptable, whereas I think we'd only be able to allow SERIAL with the heavier repair.On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.JaydeepOn Tue, May 20, 2025 at 5:31 PM Blake Eggleston  wrote:I had an idea that’s a kind of a hybrid between the index approach and the merkle tree approach. Basically we keep something kind of like an index that only contains a hash of the data between a base partition and view partition intersection. So it would structure data like this:view_range -> base_token -> view_token -> contents_hashBoth the base and view would maintain identical structures, and instead of trying to keep the hash always up to date with the data on disk, like an index, we would just mark base/view token combos as dirty when we get a write to a given base/view token combo. When we do a repair on a base/view range intersection, we recompute the content hashes for any dirty entries. Possibly with a background job that makes sure we don’t accumulate too many dirty entries if repair isn’t running often or something.So that’s 3 longs for each base/view intersection, comparable via sequential reads, and would allow us to quickly detect any inconsistencies between the base and view.On Tue, May 20, 2025, at 2:33 PM, Jaydeep Chovatia wrote:>* Consistency question: In the case where a base table gets a corrupt SSTable and is scrubbed, when it repairs against the view, without tracking the deletes against the secondary table, do we end up pushing the lack of data into the MV? >I think we'd still need to combine the output from the other replicas so that doesn't happen.SSTable corruption is one potential issue, but there are additional scenarios to consider—such as a node missing data. For this reason, the example in the proposal includes all replicas when performing materialized view (MV) repair, as a precaution to ensure better Base<->MV consistency. Here's a snippet...JaydeepOn Tue, May 20, 2025 at 1:55 PM Blake Eggleston  wrote:* Consistency question: In the case where a base table gets a corrupt 
SSTable and is scrubbed, when it repairs against the view, without 
tracking the deletes against the secondary table, do we end up push

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Jon Haddad
I agree with Blake.  These are perfectly reasonable discussions to have up
front.

Snapshot based repair has a huge downside in that you're repairing data
that's days or weeks old.  There's going to be issues that arise from that
especially since the deletes that are recorded on the MV aren't going to be
stored anywhere else.  Jeff's brought up the tombstone consistency problem
a few times, I can't recall having seen an answer to that.  If we're trying
to make these production ready, these need to actually work correctly.  The
time to complete a repair *has* to be reasonable as well, because if it
does take weeks, then we're basically saying backup / restore and MVs can't
be used together due to time to restore.  Either that or we need to solve
the problem of eventually consistent backups.

To repeat myself from an earlier message in the thread:

"Add acceptance criteria added to the CEP that this has been tested at a
reasonably large scale, preferably with a base table dataset of *at least*
100TB (preferably more) & significant variance in both base & MV partition
size, prior to merge. "

Having *some* acceptance criteria, for me, is non-negotiable.

I still don't think building anything new on Paxos makes any sense at all,
given that this would go into C* 7.0, not 6.0.  If Accord isn't ready for
something like this (I think it's kind of the perfect use case) then we
should be having a different discussion - one that involves removing
Accord.

If we were to bring this to a vote today, I'm afraid I'd be deciding
between a -.9 (extreme disapproval but non-blocking) and a -1 for my
concerns around correctness, use of global, definitely stale snapshots, and
the choice of Paxos over Accord.  I think there's a lot of really good
stuff in the proposal, but what I'm seeing in there doesn't look like
something I would recommend people use.

Jon

On Wed, May 21, 2025 at 1:00 PM Blake Eggleston 
wrote:

> I don’t think it’s trivial, but I also don’t think it’s any more difficult
> then adding a mechanism to snapshot and build this merkle tree grid.
> Remember, we can’t just start a full cluster wide table scan at full blast
> everytime we want to start a new repair cycle. There’s going to need to be
> some gradual build coordination.
>
> I also don’t think this belongs in a follow on task. I think that the
> original proposal is incomplete without a better repair story and I’m not
> sure I’d support the cep if it proceeded as is.
>
> On Wed, May 21, 2025, at 12:22 PM, Benedict wrote:
>
>
> Depending how long the grid structure takes to build, there is perhaps
> anyway value in being able to update the snapshot after construction, so
> that when the repair is performed it is as up to date as possible. But, I
> don’t think this is trivial? I have some ideas how this might be done but
> they aren’t ideal, and could be costly or error prone. Have you sketched
> out a mechanism for this?
>
> I like the idea that the same mechanism could be used to build a one-off
> snapshot as maintain a live on, leaving the operator to decide what they
> prefer.
>
> Since this seems like an extension to the original proposal, I would
> suggest the original proposal is advanced and live updates to the snapshot
> is developed in follow up work.
>
>
> On 21 May 2025, at 17:45, Blake Eggleston  wrote:
>
> 
>
> 1. Isn't this hybrid approach conceptually similar to the grid structure
> described in the proposal? The main distinction is that the original
> proposal involves recomputing the entire grid during each repair cycle. In
> contrast, the approach outlined below optimizes this by only reconstructing
> individual cells marked as dirty due to recent updates.
>
>
> Yes, it is. FWIW I think the grid structure is the right approach
> conceptually, but it has some drawbacks as proposed that I think we should
> try to improve. The simple index approach takes the same high level
> approach with a different set of tradeoffs
>
> 2. If we adopt the dirty marker approach, it may not account for cases
> where there were no user writes, but inconsistencies still arose between
> the base and the view—such as those caused by SSTable bit rot, streaming
> anomalies, or other low-level issues.
>
>
> That's true. You could force a rebuild of the marker table if you knew
> there was a problem, and it might even be a good idea to have a process
> that slowly rebuilds the table in the background. The important part is
> that there's a low upfront cost to starting a repair, and that individual
> range pairs can be repaired quickly (by repair standards).
>
> Another advantage to having a more lightweight repair mechanism is that we
> can tolerate riskier client patterns. For instance, if we went with a paxos
> based approach, I think LOCAL_SERIAL writes would become acceptable,
> whereas I think we'd only be able to allow SERIAL with the heavier repair.
>
> On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:
>
> 1. Isn't this hybrid approach conceptually

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-21 Thread Benedict
It’s an additional piece of work. If you need to be able to rebuild this data, then you need the original proposal either way. This proposal to maintain a live updating snapshot is therefore an additional feature on top of the MVP proposed.I don’t think this new proposal is fully fleshed out, I have a lot more questions about it than I do about the original proposal. I also don’t think it is healthy to weigh down other contributors’ proposals that move the state of the database forwards with our own design goals, without very strong justification. I think it is fine to say to users: repair exists, it shouldn’t ordinarily need to be run, but if you do it will for now require a snapshotting process. This would be acceptable to me as an operator (workload dependent), and I’m sure it would be acceptable to others, including the contributors undertaking the work.In the meantime there’s time to sketch out how such an online update process would work in more detail. I think it is achievable but much less obvious than building a snapshot.On 21 May 2025, at 21:00, Blake Eggleston  wrote:I don’t think it’s trivial, but I also don’t think it’s any more difficult then adding a mechanism to snapshot and build this merkle tree grid. Remember, we can’t just start a full cluster wide table scan at full blast everytime we want to start a new repair cycle. There’s going to need to be some gradual build coordination.I also don’t think this belongs in a follow on task. I think that the original proposal is incomplete without a better repair story and I’m not sure I’d support the cep if it proceeded as is.On Wed, May 21, 2025, at 12:22 PM, Benedict wrote:Depending how long the grid structure takes to build, there is perhaps anyway value in being able to update the snapshot after construction, so that when the repair is performed it is as up to date as possible. But, I don’t think this is trivial? I have some ideas how this might be done but they aren’t ideal, and could be costly or error prone. Have you sketched out a mechanism for this?I like the idea that the same mechanism could be used to build a one-off snapshot as maintain a live on, leaving the operator to decide what they prefer. Since this seems like an extension to the original proposal, I would suggest the original proposal is advanced and live updates to the snapshot is developed in follow up work.On 21 May 2025, at 17:45, Blake Eggleston  wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.Yes, it is. FWIW I think the grid structure is the right approach conceptually, but it has some drawbacks as proposed that I think we should try to improve. The simple index approach takes the same high level approach with a different set of tradeoffs2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.That's true. You could force a rebuild of the marker table if you knew there was a problem, and it might even be a good idea to have a process that slowly rebuilds the table in the background. The important part is that there's a low upfront cost to starting a repair, and that individual range pairs can be repaired quickly (by repair standards).Another advantage to having a more lightweight repair mechanism is that we can tolerate riskier client patterns. For instance, if we went with a paxos based approach, I think LOCAL_SERIAL writes would become acceptable, whereas I think we'd only be able to allow SERIAL with the heavier repair.On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.JaydeepOn Tue, May 20, 2025 at 5:31 PM Blake Eggleston  wrote:I had an idea that’s a kind of a hybrid between the index approach and the merkle tree approach. Basically we keep something kind of like an index that only contains a hash of the data between a base partition and view partition intersection. So it would structure data like this:view_range -> base_token -> view_token -> contents

Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Ekaterina Dimitrova
“I'm curious what this raises for you. “

A few points that come to mind:

- every time we switch/add JDKs we also need to do a bunch of changes in CI
systems, ccm, etc, not only C* - so more work to call out. Also, if we make
older versions support newer JDK, I guess we need to ensure drivers, etc
will support it too probably? Are we discussing JDK support here only for
Cassandra repo?
- very often we need to bump library versions to support newer JDK versions
but at the same time we try not to upgrade dependencies in patch release;
only if it is bug related, in most cases
- whether it is a lot of work or not to backport, I’d say it depends. My
assumption is that if we keep our maintenance regularly going (which we
missed with the long development cycle of 4.0) - it is more feasible.
Though we know that we removed a whole feature to move to JDK17 quicker -
the scripted UDFs. If we have similar needs at any time - we can’t do such
breaking changes in a patch release.
- Benedict made a great point on performance changes with JDK upgrades - we
do not have regular performance testing so probably introducing a new JDK
in a patch version will come with a huge warning - test thoroughly and move
to prod at your own judgement or something like that.

I guess there are more things to consider but these are immediate things
that come to my mind now.

Best regards,
Ekaterina

On Wed, 21 May 2025 at 10:31, Josh McKenzie  wrote:

> Lessons learned from advancing JDK support on trunk *should* translate
> into older branches making that effort much smaller; Ekaterina you have a
> lot of experience here so I'm curious what this raises for you. I like the
> productivity implications of us being able to adopt new language features
> faster on trunk; I think this is a solid evolution of the idea, definitely.
>
> Distilling to bulleted lists to try and snapshot the state of the thread
> w/the above proposal:
>
> *[New LTS JDK Adoption]*
>
>- Trunk supports 1 JDK at a time
>- That JDK will be the GA LTS the day we cut a frozen branch for a new
>major (i.e. from moment of previous release bifurcation, trunk snapshots
>the JDK at that moment). Obviously there will be some flexibility here in
>terms of when the work lands on trunk and supporting on other branches, but
>the general pattern / intent hold - push to snapshot latest GA LTS JDK on
>trunk ASAP after branching for a major.
>- Trunk targets the language level of that JDK
>- CI on trunk is that single JDK only
>- We merge new JDK LTS support to all supported branches at the same
>time as trunk
>- We up the supported language level for all supported branches to the
>latest supported JDK at this time
>- We don't need to worry about dropping JDK support as that will
>happen naturally w/the dropping of support for a branch. Branches will
>slowly gain JDK support w/each subsequent trunk-based LTS integration.
>
> *[Branch JDK Support]*
>
>- N-2: JDK, JDK-1, JDK-2
>- N-1: JDK, JDK-1
>- N: JDK
>
> *[CI, JDK's, Upgrades]*
>
>- CI:
>   - For each branch we run per-commit CI for the latest JDK they
>   support
>   - Periodically we run all CI pipelines for older JDK's per-branch
>   (cadence TBD)
>- Upgrades
>   - N-2 -> N-1: tested on JDK and JDK-1
>   - N-2 -> N: tested on JDK
>   - N-1 -> N: tested on JDK
>
> That'd give us 4 upgrade paths we'd need to support and test which feels
> like it's in the territory of "doable on each commit" if we limit the
> upgrade tests to the in-jvm variety and let the periodic run capture the
> python upgrade tests space.
>
> On Wed, May 21, 2025, at 9:30 AM, Benedict wrote:
>
>
> Perhaps we should consider back porting support for newer Java LTS
> releases to older C* versions, and suggesting users upgrade JDK first. This
> way we can have trunk always on the latest LTS, advancing language feature
> support more quickly.
>
> That is, we would have something like
>
> N-2: JDK, JDK-1, JDK-2
> N-1: JDK, JDK-1
> N: JDK
>
> I think to assist those deploying trunk and reduce churn for development,
> we might only want to advance the LTS version for trunk after we release a
> new major, fixing the next release’s Java version at that point.
>
>
>- On 21 May 2025, at 13:57, Josh McKenzie  wrote:
>
> 
>
> You don’t have to run every suite on every commit since as folks have
> pointed out for the most part the JVM isn’t culprit. Need to run it enough
> times to catch when it is for some assumption of “enough”.
>
> So riffing on this. We could move to something like:
>
>- For each given supported C* branch, confirm it *builds *on all
>supported JDKs (pre-commit verification, post-commit reactive runs)
>- Constrain language level on any given C* branch to *lowest supported
>JDK*
>- Run all reactive post-commit CI pipelines against *the highest
>supported JDK only*
>- Once a N (day, week, month?), run all pipelines ag

Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Benedict
Yes the issue of Nashorn did spring to mind, but as I recall this was an optional feature. I don’t remember how hard it would have been to simply declare the feature unavailable if you use the newer JDK, but my vague recollection is the hard part was primarily finding a suitable replacement.We may well hit similar issues in future, some perhaps even harder to surmount, but I’m sure we can address them as they come on a case by case basis. Worst case we have to postpone the migration by one major for any deprecation to take effect.On 21 May 2025, at 19:57, Ekaterina Dimitrova  wrote:“I'm curious what this raises for you. “A few points that come to mind:- every time we switch/add JDKs we also need to do a bunch of changes in CI systems, ccm, etc, not only C* - so more work to call out. Also, if we make older versions support newer JDK, I guess we need to ensure drivers, etc will support it too probably? Are we discussing JDK support here only for Cassandra repo?- very often we need to bump library versions to support newer JDK versions but at the same time we try not to upgrade dependencies in patch release; only if it is bug related, in most cases- whether it is a lot of work or not to backport, I’d say it depends. My assumption is that if we keep our maintenance regularly going (which we missed with the long development cycle of 4.0) - it is more feasible. Though we know that we removed a whole feature to move to JDK17 quicker - the scripted UDFs. If we have similar needs at any time - we can’t do such breaking changes in a patch release.- Benedict made a great point on performance changes with JDK upgrades - we do not have regular performance testing so probably introducing a new JDK in a patch version will come with a huge warning - test thoroughly and move to prod at your own judgement or something like that. I guess there are more things to consider but these are immediate things that come to my mind now.Best regards,EkaterinaOn Wed, 21 May 2025 at 10:31, Josh McKenzie  wrote:Lessons learned from advancing JDK support on trunk should translate into older branches making that effort much smaller; Ekaterina you have a lot of experience here so I'm curious what this raises for you. I like the productivity implications of us being able to adopt new language features faster on trunk; I think this is a solid evolution of the idea, definitely.Distilling to bulleted lists to try and snapshot the state of the thread w/the above proposal:[New LTS JDK Adoption]Trunk supports 1 JDK at a timeThat JDK will be the GA LTS the day we cut a frozen branch for a new major (i.e. from moment of previous release bifurcation, trunk snapshots the JDK at that moment). Obviously there will be some flexibility here in terms of when the work lands on trunk and supporting on other branches, but the general pattern / intent hold - push to snapshot latest GA LTS JDK on trunk ASAP after branching for a major.Trunk targets the language level of that JDKCI on trunk is that single JDK onlyWe merge new JDK LTS support to all supported branches at the same time as trunkWe up the supported language level for all supported branches to the latest supported JDK at this timeWe don't need to worry about dropping JDK support as that will happen naturally w/the dropping of support for a branch. Branches will slowly gain JDK support w/each subsequent trunk-based LTS integration.[Branch JDK Support]N-2: JDK, JDK-1, JDK-2N-1: JDK, JDK-1N: JDK[CI, JDK's, Upgrades]CI:For each branch we run per-commit CI for the latest JDK they supportPeriodically we run all CI pipelines for older JDK's per-branch (cadence TBD)UpgradesN-2 -> N-1: tested on JDK and JDK-1N-2 -> N: tested on JDKN-1 -> N: tested on JDKThat'd give us 4 upgrade paths we'd need to support and test which feels like it's in the territory of "doable on each commit" if we limit the upgrade tests to the in-jvm variety and let the periodic run capture the python upgrade tests space.On Wed, May 21, 2025, at 9:30 AM, Benedict wrote:Perhaps we should consider back porting support for newer Java LTS releases to older C* versions, and suggesting users upgrade JDK first. This way we can have trunk always on the latest LTS, advancing language feature support more quickly. That is, we would have something like N-2: JDK, JDK-1, JDK-2N-1: JDK, JDK-1N: JDKI think to assist those deploying trunk and reduce churn for development, we might only want to advance the LTS version for trunk after we release a new major, fixing the next release’s Java version at that point.On 21 May 2025, at 13:57, Josh McKenzie  wrote:You don’t have to run every suite on every commit since as folks have pointed out for the most part the JVM isn’t culprit. Need to run it enough times to catch when it is for some assumption of “enough”. So riffing on this. We could move to something like:For each given supported C* branch, confirm it builds on all supported JDKs (pre-commit verification, pos

Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Mick Semb Wever
   .

>
> For the rare edge case where we have to stop supporting something entirely
> because it's incompatible with a JDK release (has this happened more than
> the 1 time?) - I think a reasonable fallback is to just not backport new
> JDK support and consider carrying forward the older JDK support until the
> release w/the feature in it is EoL'ed. That'd allow us to continue to run
> in-jvm upgrade dtests between the versions on the older JDK.
>


This.
I think the idea of adding new major JDKs to release branches for a number
of reasons, in theory at least.  In practice I think this would have to be
discussed and evaluated for each new JDK and release branch.

And this would mean we can drop a JDK in trunk once it's no longer the
latest in all latest maintained patch versions.


Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Ekaterina Dimitrova
Benedict, I am not sure what do you mean by optional feature. FWIW we
cannot compile cassandra-4.1 until we removed the feature in cassandra-5.0.
I, as a user would be very disappointed a feature to be removed in a patch
release.

Yes, replacing nashorn was the unpleasant part. I did not raise the nashorn
part as if removing the scripted UDFs was a hard technical task, but more
to flag we wouldn’t want to make such breaking changes in patch releases.

“We may well hit similar issues in future, some perhaps even harder to
surmount, but I’m sure we can address them as they come on a case by case
basis. Worst case we have to postpone the migration by one major for any
deprecation to take effect.”

Agreed, though the lack of performance testing still stands for me.

I just got reminded - there was also some time format issue with JDK11 that
Scott mentioned before, if I remember correctly.

So yeah, these are the type of things we may have in front of us. Also, I
can’t wait to find a replacement for jamm so we don’t have to think of it
anymore.

On Wed, 21 May 2025 at 15:17, Benedict  wrote:

> Yes the issue of Nashorn did spring to mind, but as I recall this was an
> optional feature. I don’t remember how hard it would have been to simply
> declare the feature unavailable if you use the newer JDK, but my vague
> recollection is the hard part was primarily finding a suitable replacement.
>
> We may well hit similar issues in future, some perhaps even harder to
> surmount, but I’m sure we can address them as they come on a case by case
> basis. Worst case we have to postpone the migration by one major for any
> deprecation to take effect.
>
> On 21 May 2025, at 19:57, Ekaterina Dimitrova 
> wrote:
>
> 
>
> “I'm curious what this raises for you. “
>
> A few points that come to mind:
>
> - every time we switch/add JDKs we also need to do a bunch of changes in
> CI systems, ccm, etc, not only C* - so more work to call out. Also, if we
> make older versions support newer JDK, I guess we need to ensure drivers,
> etc will support it too probably? Are we discussing JDK support here only
> for Cassandra repo?
> - very often we need to bump library versions to support newer JDK
> versions but at the same time we try not to upgrade dependencies in patch
> release; only if it is bug related, in most cases
> - whether it is a lot of work or not to backport, I’d say it depends. My
> assumption is that if we keep our maintenance regularly going (which we
> missed with the long development cycle of 4.0) - it is more feasible.
> Though we know that we removed a whole feature to move to JDK17 quicker -
> the scripted UDFs. If we have similar needs at any time - we can’t do such
> breaking changes in a patch release.
> - Benedict made a great point on performance changes with JDK upgrades -
> we do not have regular performance testing so probably introducing a new
> JDK in a patch version will come with a huge warning - test thoroughly and
> move to prod at your own judgement or something like that.
>
> I guess there are more things to consider but these are immediate things
> that come to my mind now.
>
> Best regards,
> Ekaterina
>
> On Wed, 21 May 2025 at 10:31, Josh McKenzie  wrote:
>
>> Lessons learned from advancing JDK support on trunk *should* translate
>> into older branches making that effort much smaller; Ekaterina you have a
>> lot of experience here so I'm curious what this raises for you. I like the
>> productivity implications of us being able to adopt new language features
>> faster on trunk; I think this is a solid evolution of the idea, definitely.
>>
>> Distilling to bulleted lists to try and snapshot the state of the thread
>> w/the above proposal:
>>
>> *[New LTS JDK Adoption]*
>>
>>- Trunk supports 1 JDK at a time
>>- That JDK will be the GA LTS the day we cut a frozen branch for a
>>new major (i.e. from moment of previous release bifurcation, trunk
>>snapshots the JDK at that moment). Obviously there will be some 
>> flexibility
>>here in terms of when the work lands on trunk and supporting on other
>>branches, but the general pattern / intent hold - push to snapshot latest
>>GA LTS JDK on trunk ASAP after branching for a major.
>>- Trunk targets the language level of that JDK
>>- CI on trunk is that single JDK only
>>- We merge new JDK LTS support to all supported branches at the same
>>time as trunk
>>- We up the supported language level for all supported branches to
>>the latest supported JDK at this time
>>- We don't need to worry about dropping JDK support as that will
>>happen naturally w/the dropping of support for a branch. Branches will
>>slowly gain JDK support w/each subsequent trunk-based LTS integration.
>>
>> *[Branch JDK Support]*
>>
>>- N-2: JDK, JDK-1, JDK-2
>>- N-1: JDK, JDK-1
>>- N: JDK
>>
>> *[CI, JDK's, Upgrades]*
>>
>>- CI:
>>   - For each branch we run per-commit CI for the latest JDK they
>>   s

Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Josh McKenzie
Great context - thanks for that insight.

Operators running the older supported versions of C* will retain the *option* 
to run the older JDK, however if they want to upgrade their JDK version and C* 
version *separately* under the above paradigm, they'd need to rev their JDK 
separately on their clusters before running the C* version upgrade.

The need to bump deps for JDK support is very real and that does concern me - 
really great point. Bumping dependencies in an older C* version because a newer 
JDK you're not interested in using needed to be supported would not be a 
positive experience for a user; you're effectively taking on risk for no new 
functionality. *In theory*, we could have conditional dependency inclusion 
based on what version of the JDK you're building a cassandra build for. A 
cursory inspection of this topic in gradle and ant both shows it's possible, if 
a bit cleaner and simpler in the former than the latter.

My recollection of JDK17 to JDK21 was under 5 differences we needed to bump, so 
maintaining a per-jdk list of conditional dependency versions wouldn't be an 
overwhelming burden - at least from that 1 example. Do you recall how many 
dependencies needed to be bumped on the JDK11 to JDK17 transition Ekaterina?

And your point about our lack of performance testing and JDK changes 
translating into perf changes is also one that resonates strongly with me as 
well. There's a straightforward fix there too. :D

For the rare edge case where we have to stop supporting something entirely 
because it's incompatible with a JDK release (has this happened more than the 1 
time?) - I think a reasonable fallback is to just not backport new JDK support 
and consider carrying forward the older JDK support until the release w/the 
feature in it is EoL'ed. That'd allow us to continue to run in-jvm upgrade 
dtests between the versions on the older JDK.

Also - we can't up language level on older branches w/newer JDK support which I 
hand-waved at in my wishlist. They'd obviously not build on older JDKs if we 
did that, and that'd force the ecosystem that relies on cassandra-all update at 
that time as well which wouldn't be pretty.

On Wed, May 21, 2025, at 3:27 PM, Ekaterina Dimitrova wrote:
> Benedict, I am not sure what do you mean by optional feature. FWIW we cannot 
> compile cassandra-4.1 until we removed the feature in cassandra-5.0. I, as a 
> user would be very disappointed a feature to be removed in a patch release. 
> 
> Yes, replacing nashorn was the unpleasant part. I did not raise the nashorn 
> part as if removing the scripted UDFs was a hard technical task, but more to 
> flag we wouldn’t want to make such breaking changes in patch releases.
> 
> “We may well hit similar issues in future, some perhaps even harder to 
> surmount, but I’m sure we can address them as they come on a case by case 
> basis. Worst case we have to postpone the migration by one major for any 
> deprecation to take effect.”
> 
> Agreed, though the lack of performance testing still stands for me. 
> 
> I just got reminded - there was also some time format issue with JDK11 that 
> Scott mentioned before, if I remember correctly.
> 
> So yeah, these are the type of things we may have in front of us. Also, I 
> can’t wait to find a replacement for jamm so we don’t have to think of it 
> anymore. 
> 
> On Wed, 21 May 2025 at 15:17, Benedict  wrote:
>> 
>> Yes the issue of Nashorn did spring to mind, but as I recall this was an 
>> optional feature. I don’t remember how hard it would have been to simply 
>> declare the feature unavailable if you use the newer JDK, but my vague 
>> recollection is the hard part was primarily finding a suitable replacement.
>> 
>> We may well hit similar issues in future, some perhaps even harder to 
>> surmount, but I’m sure we can address them as they come on a case by case 
>> basis. Worst case we have to postpone the migration by one major for any 
>> deprecation to take effect.
>> 
>> 
>>> On 21 May 2025, at 19:57, Ekaterina Dimitrova  wrote:
>>> 
>>> “I'm curious what this raises for you. “
>>> 
>>> A few points that come to mind:
>>> 
>>> - every time we switch/add JDKs we also need to do a bunch of changes in CI 
>>> systems, ccm, etc, not only C* - so more work to call out. Also, if we make 
>>> older versions support newer JDK, I guess we need to ensure drivers, etc 
>>> will support it too probably? Are we discussing JDK support here only for 
>>> Cassandra repo?
>>> - very often we need to bump library versions to support newer JDK versions 
>>> but at the same time we try not to upgrade dependencies in patch release; 
>>> only if it is bug related, in most cases
>>> - whether it is a lot of work or not to backport, I’d say it depends. My 
>>> assumption is that if we keep our maintenance regularly going (which we 
>>> missed with the long development cycle of 4.0) - it is more feasible. 
>>> Though we know that we removed a whole feature to move to JDK17 quicker - 

Re: [DISCUSS] GnuParser / Posix command like argument parser in tools

2025-05-21 Thread David Capwell
> What is the official policy we have around arguments parsing?

From a style point of view I don’t think its something the project has taken a 
stance on, its something you define as the author to the CLI you are working on.

> What kind of style should we default to for tools? Posix or Gnu?

From what I can tell GNU mostly has w/e POSIX has, but has more flexibility.  
There are a few cases I see where POSIX can do things that GNU does differently 
but doesn’t really seem to matter there…

So from what I can tell; personally defaulting to GNU makes the most sense to 
me for new things

> On May 20, 2025, at 11:22 AM, Jon Haddad  wrote:
> 
> I've written a few dozen tools over the years and have been happy with 
> JCommander.  Picocli looks to follow pretty much all the same conventions, 
> but has a few nicer features on top.  I'd love to see the entire project 
> standardize on it.
> 
> On Tue, May 20, 2025 at 2:29 AM Štefan Miklošovič  > wrote:
>> Hey,
>> 
>> I mapped what command line parser styles we use across the project while 
>> dealing with some ticket (20448) and it is mixed like this, I am talking 
>> about stuff we use in commons-cli for Gnu and Posix parsers:
>> 
>> GnuParser
>> 
>> StandaloneSplitter
>> BulkLoader (aka sstableloader)
>> HashPassword
>> GenerateTokens
>> AuditLogViewer
>> StandaloneVerifier
>> StandaloneSSTableUtil
>> StandaloneUpgrader
>> StandaloneScrubber
>> 
>> PosixParser
>> 
>> SSTablePartitions
>> SSTableMetadataViewer
>> SSTableExport
>> AbstractJMXClient which is not used anywhere, huh?
>> 
>> For these we use manual parsing of arguments without any library
>> 
>> SSTableRepairedAtSetter
>> SSTableOfflineRelevel
>> SSTableLevelResetter
>> SSTableExpiredBlockers
>> TransformClusterMetadataHelper
>> 
>> airline / (will be picocli-ed soon)
>> 
>> JMXTool
>> FullQueryLogTool
>> CompactionStress
>> 
>> nodetool - will be picocli-ed
>> 
>> Stress - totally custom stuff but cassandra-easy-stress just landed now 
>> which uses "com.beust.jcommander"
>> 
>> For what we have in tools, what should I base it on going forward? 
>> 
>> As a curiosity / interestingly enough, from the perspective of common-cli 
>> implementation, both GnuParser as well as PosixParser are marked as 
>> deprecated. What is not deprecated in commons-cli is "DefaultParser" which 
>> can be configured to mimic either style (gnu / posix).
>> 
>> What is the official policy we have around arguments parsing? What kind of 
>> style should we default to for tools? Posix or Gnu? (not talking about 
>> nodetool as that is a category in itself).
>> 
>> It seems like GnuParser is more prevalent, so is that one the winner here? 
>> 
>> I understand that there is a ton of legacy and we can not just change the 
>> styles easily for the code already there without a big bang. What I am 
>> trying to do here is to just know what we will go with. I do not think that 
>> mixing two styles arbitrarily is good. 
>> 
>> Regards



Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Josh McKenzie
Lessons learned from advancing JDK support on trunk *should* translate into 
older branches making that effort much smaller; Ekaterina you have a lot of 
experience here so I'm curious what this raises for you. I like the 
productivity implications of us being able to adopt new language features 
faster on trunk; I think this is a solid evolution of the idea, definitely.

Distilling to bulleted lists to try and snapshot the state of the thread w/the 
above proposal:

*[New LTS JDK Adoption]*
 • Trunk supports 1 JDK at a time
 • That JDK will be the GA LTS the day we cut a frozen branch for a new major 
(i.e. from moment of previous release bifurcation, trunk snapshots the JDK at 
that moment). Obviously there will be some flexibility here in terms of when 
the work lands on trunk and supporting on other branches, but the general 
pattern / intent hold - push to snapshot latest GA LTS JDK on trunk ASAP after 
branching for a major.
 • Trunk targets the language level of that JDK
 • CI on trunk is that single JDK only
 • We merge new JDK LTS support to all supported branches at the same time as 
trunk
 • We up the supported language level for all supported branches to the latest 
supported JDK at this time
 • We don't need to worry about dropping JDK support as that will happen 
naturally w/the dropping of support for a branch. Branches will slowly gain JDK 
support w/each subsequent trunk-based LTS integration.
*[Branch JDK Support]*
 • N-2: JDK, JDK-1, JDK-2
 • N-1: JDK, JDK-1
 • N: JDK
*[CI, JDK's, Upgrades]*
 • CI:
   • For each branch we run per-commit CI for the latest JDK they support
   • Periodically we run all CI pipelines for older JDK's per-branch (cadence 
TBD)
 • Upgrades
   • N-2 -> N-1: tested on JDK and JDK-1
   • N-2 -> N: tested on JDK
   • N-1 -> N: tested on JDK
That'd give us 4 upgrade paths we'd need to support and test which feels like 
it's in the territory of "doable on each commit" if we limit the upgrade tests 
to the in-jvm variety and let the periodic run capture the python upgrade tests 
space.

On Wed, May 21, 2025, at 9:30 AM, Benedict wrote:
> 
> Perhaps we should consider back porting support for newer Java LTS releases 
> to older C* versions, and suggesting users upgrade JDK first. This way we can 
> have trunk always on the latest LTS, advancing language feature support more 
> quickly. 
> 
> That is, we would have something like 
> 
> N-2: JDK, JDK-1, JDK-2
> N-1: JDK, JDK-1
> N: JDK
> 
> I think to assist those deploying trunk and reduce churn for development, we 
> might only want to advance the LTS version for trunk after we release a new 
> major, fixing the next release’s Java version at that point.
> 
>>  • On 21 May 2025, at 13:57, Josh McKenzie  wrote:
>> 
>>> You don’t have to run every suite on every commit since as folks have 
>>> pointed out for the most part the JVM isn’t culprit. Need to run it enough 
>>> times to catch when it is for some assumption of “enough”. 
>> So riffing on this. We could move to something like:
>>  • For each given supported C* branch, confirm it **builds **on all 
>> supported JDKs (pre-commit verification, post-commit reactive runs)
>>  • Constrain language level on any given C* branch to **lowest supported 
>> JDK**
>>  • Run all reactive post-commit CI pipelines against *the *highest supported 
>> JDK only**
>>  • Once a N (day, week, month?), run all pipelines against all supported 
>> JDKs on all branches
>>• Augment notification mechanisms so it squawks to dev list and slack on 
>> failure of non-highest JDK pipelines
>> That approach would tweak our balance towards our perception of the 
>> infrequency of per-JDK failures while allowing us to "scale up" the matrix 
>> of tests that we perform.
>> 
>> i.e. once a week we could have a heavy 9x run (3 branches, 3 JDK's) which we 
>> could then plan around and space out in terms of resource allocation, but 
>> otherwise we run a single set of pipelines per branch post-commit.
>> 
>> That'd give us the confidence to say "we tested the upgrade path we're 
>> recommending for you" without having to pay the tax of doing it on every 
>> commit or allowing potential defects to pile up to a once-a-year 
>> JDK-specific bug-bash.
>> 
>> In terms of JDK support when bumping (mapping of relative C* version and 
>> relative JDK version):
>>  • N-2: JDK-2, JDK-3, JDK-4
>>  • N-1: JDK-1, JDK-2, JDK-3 
>>  • N: JDK, JDK-1, JDK-2
>> So we'd have 3 supported LTS per branch, be able to adhere to "you can 
>> upgrade from N-2 to N using the same JDK", and allow us to balance our CI 
>> coverage to our expected surfacing of defects.
>> 
>> Then if we rev JDK we support on any given N+1, we end up with (keeping with 
>> N above as reference):
>>  • N-1: JDK-1, JDK-2, JDK-3
>>  • N: JDK, JDK-1, JDK-2
>>  • N+1: JDK+1, JDK, JDK-1
>> So shared JDK across all 3 on that rev is JDK-1.
>> 
>> I think 3 LTS per branch gives us the ability to both add / drop a JDK per 
>> major and test / provide for

Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Benedict
Perhaps we should consider back porting support for newer Java LTS releases to 
older C* versions, and suggesting users upgrade JDK first. This way we can have 
trunk always on the latest LTS, advancing language feature support more 
quickly. 

That is, we would have something like 

N-2: JDK, JDK-1, JDK-2
N-1: JDK, JDK-1
N: JDK

I think to assist those deploying trunk and reduce churn for development, we 
might only want to advance the LTS version for trunk after we release a new 
major, fixing the next release’s Java version at that point.

> On 21 May 2025, at 13:57, Josh McKenzie  wrote:
> 
>> 
>> You don’t have to run every suite on every commit since as folks have 
>> pointed out for the most part the JVM isn’t culprit. Need to run it enough 
>> times to catch when it is for some assumption of “enough”. 
> So riffing on this. We could move to something like:
> For each given supported C* branch, confirm it builds on all supported JDKs 
> (pre-commit verification, post-commit reactive runs)
> Constrain language level on any given C* branch to lowest supported JDK
> Run all reactive post-commit CI pipelines against the highest supported JDK 
> only
> Once a N (day, week, month?), run all pipelines against all supported JDKs on 
> all branches
> Augment notification mechanisms so it squawks to dev list and slack on 
> failure of non-highest JDK pipelines
> That approach would tweak our balance towards our perception of the 
> infrequency of per-JDK failures while allowing us to "scale up" the matrix of 
> tests that we perform.
> 
> i.e. once a week we could have a heavy 9x run (3 branches, 3 JDK's) which we 
> could then plan around and space out in terms of resource allocation, but 
> otherwise we run a single set of pipelines per branch post-commit.
> 
> That'd give us the confidence to say "we tested the upgrade path we're 
> recommending for you" without having to pay the tax of doing it on every 
> commit or allowing potential defects to pile up to a once-a-year JDK-specific 
> bug-bash.
> 
> In terms of JDK support when bumping (mapping of relative C* version and 
> relative JDK version):
> N-2: JDK-2, JDK-3, JDK-4
> N-1: JDK-1, JDK-2, JDK-3 
> N: JDK, JDK-1, JDK-2
> So we'd have 3 supported LTS per branch, be able to adhere to "you can 
> upgrade from N-2 to N using the same JDK", and allow us to balance our CI 
> coverage to our expected surfacing of defects.
> 
> Then if we rev JDK we support on any given N+1, we end up with (keeping with 
> N above as reference):
> N-1: JDK-1, JDK-2, JDK-3
> N: JDK, JDK-1, JDK-2
> N+1: JDK+1, JDK, JDK-1
> So shared JDK across all 3 on that rev is JDK-1.
> 
> I think 3 LTS per branch gives us the ability to both add / drop a JDK per 
> major and test / provide for upgrades from N-2 to N w/out requiring a new JDK 
> cert too.
> 
>> On Wed, May 21, 2025, at 3:27 AM, Mick Semb Wever wrote:
>>.
>>   
>> So yeah. I think we'll need to figure out how much coverage is reasonable to 
>> call something "tested". I don't think it's sustainable for us to have, at 
>> any given time, 3 branches we test across 3 JDK's each with all our in-jvm 
>> test suites is it?
>> 
>> 
>> 
>> Correct.
>> For non-upgrade tests, where testing against more than one jdk exists, we 
>> should start the conversation of the value of running more than one JDK for 
>> all tests per-commit CI, before we go adding a third.
>> 
>> I'm not against weekly/fortnightly CI runs, just that it deserves the 
>> discussion of cost (it's not necessarily cheaper due to saturation, nor are 
>> we a team that has assigned build barons).  The actual change is relatively 
>> easy, just adding a profile and a jdk element here: 
>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L126-L135
>>  
> 


Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Josh McKenzie
> You don’t have to run every suite on every commit since as folks have pointed 
> out for the most part the JVM isn’t culprit. Need to run it enough times to 
> catch when it is for some assumption of “enough”. 
So riffing on this. We could move to something like:
 • For each given supported C* branch, confirm it **builds **on all supported 
JDKs (pre-commit verification, post-commit reactive runs)
 • Constrain language level on any given C* branch to **lowest supported JDK**
 • Run all reactive post-commit CI pipelines against *the *highest supported 
JDK only**
 • Once a N (day, week, month?), run all pipelines against all supported JDKs 
on all branches
   • Augment notification mechanisms so it squawks to dev list and slack on 
failure of non-highest JDK pipelines
That approach would tweak our balance towards our perception of the infrequency 
of per-JDK failures while allowing us to "scale up" the matrix of tests that we 
perform.

i.e. once a week we could have a heavy 9x run (3 branches, 3 JDK's) which we 
could then plan around and space out in terms of resource allocation, but 
otherwise we run a single set of pipelines per branch post-commit.

That'd give us the confidence to say "we tested the upgrade path we're 
recommending for you" without having to pay the tax of doing it on every commit 
or allowing potential defects to pile up to a once-a-year JDK-specific bug-bash.

In terms of JDK support when bumping (mapping of relative C* version and 
relative JDK version):
 • N-2: JDK-2, JDK-3, JDK-4
 • N-1: JDK-1, JDK-2, JDK-3 
 • N: JDK, JDK-1, JDK-2
So we'd have 3 supported LTS per branch, be able to adhere to "you can upgrade 
from N-2 to N using the same JDK", and allow us to balance our CI coverage to 
our expected surfacing of defects.

Then if we rev JDK we support on any given N+1, we end up with (keeping with N 
above as reference):
 • N-1: JDK-1, JDK-2, JDK-3
 • N: JDK, JDK-1, JDK-2
 • N+1: JDK+1, JDK, JDK-1
So shared JDK across all 3 on that rev is JDK-1.

I think 3 LTS per branch gives us the ability to both add / drop a JDK per 
major and test / provide for upgrades from N-2 to N w/out requiring a new JDK 
cert too.

On Wed, May 21, 2025, at 3:27 AM, Mick Semb Wever wrote:
>.
>   
>>> So yeah. I think we'll need to figure out how much coverage is reasonable 
>>> to call something "tested". I don't think it's sustainable for us to have, 
>>> at any given time, 3 branches we test across 3 JDK's each with all our 
>>> in-jvm test suites is it?
>>> 
> 
> 
> Correct.
> For non-upgrade tests, where testing against more than one jdk exists, we 
> should start the conversation of the value of running more than one JDK for 
> all tests per-commit CI, before we go adding a third.
> 
> I'm not against weekly/fortnightly CI runs, just that it deserves the 
> discussion of cost (it's not necessarily cheaper due to saturation, nor are 
> we a team that has assigned build barons).  The actual change is relatively 
> easy, just adding a profile and a jdk element here: 
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L126-L135 


Re: [DISCUSS] How we handle JDK support

2025-05-21 Thread Mick Semb Wever
   .


> So yeah. I think we'll need to figure out how much coverage is reasonable
>> to call something "tested". I don't think it's sustainable for us to have,
>> at any given time, 3 branches we test across 3 JDK's each with all our
>> in-jvm test suites is it?
>>
>

Correct.
For non-upgrade tests, where testing against more than one jdk exists, we
should start the conversation of the value of running more than one JDK for
all tests per-commit CI, before we go adding a third.

I'm not against weekly/fortnightly CI runs, just that it deserves the
discussion of cost (it's not necessarily cheaper due to saturation, nor are
we a team that has assigned build barons).  The actual change is relatively
easy, just adding a profile and a jdk element here:
https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L126-L135