Re: [DISCUSS][CASSANDRA-20681] Mark JDK 17 as production ready for Cassandra 5.0

2025-06-02 Thread Štefan Miklošovič
Updated 5.0 and trunk branches to reflect this under CASSANDRA-20681. Let
us know if this is, for some reason, not enough and you want to propagate
this information further.

On Wed, May 28, 2025 at 11:32 PM Jeremiah Jordan 
wrote:

> +1
>
> On May 28, 2025 at 4:28:22 PM, Mick Semb Wever  wrote:
>
>> Do it.
>>
>> Four patch releases and eight months in, we're safe.
>>
>>
>> On Mon, 26 May 2025 at 21:00, Dmitry Konstantinov 
>> wrote:
>>
>>> Hi all,
>>>
>>> I've created a task to mark JDK 17 as production-ready for Cassandra 5.0
>>> in our documentation - CASSANDRA-20681
>>> 
>>>
>>> Reasons:
>>>
>>>- Cassandra 5.0.x has had four bugfix releases and is stable (5.0.0
>>>was released in September 2024, so it's been out for eight months).
>>>- I'm not aware of any open Cassandra issues specific to JDK 17.
>>>- Our CI has been running tests with JDK 17 on every commit for over
>>>a year.
>>>- JDK 17 is a mature LTS version. We already have a newer LTS (JDK
>>>21), and JDK 24 has already been released.
>>>- There might be a vicious cycle here: we’re waiting for more user
>>>feedback, while users are waiting for the feature to be marked as
>>>non-experimental before adopting it more widely.
>>>
>>>
>>> Any objections to marking JDK 17 as production-ready for 5.0?
>>>
>>> Related threads where the topic about JDK17 status has been raised:
>>>
>>>- https://the-asf.slack.com/archives/CK23JSY2K/p1744313849439569
>>>- https://the-asf.slack.com/archives/CJZLTM05A/p1746787244618429
>>>- https://lists.apache.org/thread/np70b8ck21k0ojsjnotg3j9p2rrj29dp
>>>-
>>>
>>> https://stackoverflow.com/questions/79563058/java-17-support-for-cassandra-5
>>>
>>>
>>>
>>> --
>>> Dmitry Konstantinov
>>>
>>


Re: [DISCUSS] How we handle JDK support

2025-06-02 Thread Doug Rohrer
Only thing I’d suggest changing here is “Trunk targets the language level of 
that JDK” shouldn’t happen until after we’ve confirmed the back port of the new 
JDK LTS changes to previous versions - otherwise, you have folks starting to 
use new language features and then have to rip them all out when you find that 
some previous supported Cassandra release can’t use that JDK.

Doug

> On May 27, 2025, at 10:37 AM, Josh McKenzie  wrote:
> 
> revised snapshot of the state of conversation here:
> 
> [New LTS JDK Adoption]
> Trunk supports 1 JDK at a time
> After a branch is cut for a release, we push to get trunk to support latest 
> LTS JDK version available at that time
> Trunk targets the language level of that JDK
> CI on trunk is that single JDK only
> We merge new JDK LTS support to all supported branches at the same time as 
> trunk
> In the very rare case a feature would have to be removed due to JDK change 
> (think UDF's scripting engine), we instead keep the maximum allowable JDK for 
> that feature supported on trunk and subsequent releases. We then drop that 
> JDK across all branches once the oldest C* w/that feature ages out of support.
> Otherwise, we don't need to worry about dropping JDK support as that will 
> happen naturally w/the dropping of support for a branch. Branches will slowly 
> gain JDK support w/each subsequent trunk-based LTS integration.
> [Branch JDK Support]
> N-2: JDK, JDK-1, JDK-2
> N-1: JDK, JDK-1
> N: JDK
> [CI, JDK's, Upgrades]
> CI:
> For each branch we run per-commit CI for the latest JDK they support
> TODO: Periodically we run all CI pipelines for older JDK's per-branch 
> (cadence TBD)
> TODO: We add basic perf testing across all GA branches with reference 
> workloads (easy-cass-stress workloads? 
> )
> Upgrades
> N-2 -> N-1: tested on JDK and JDK-1
> N-2 -> N: tested on JDK
> N-1 -> N: tested on JDK
> 
> ---
> The above has 2 non-trivial CI orchestration investments:
> Running all CI across all supported JDK on a cadence
> Adding some basic perf smoke tests
> Both seem reasonable to me.
> 
> On Fri, May 23, 2025, at 7:39 AM, Mick Semb Wever wrote:
>>  
>>.
>>   
>>   
>> For the rare edge case where we have to stop supporting something entirely 
>> because it's incompatible with a JDK release (has this happened more than 
>> the 1 time?) - I think a reasonable fallback is to just not backport new JDK 
>> support and consider carrying forward the older JDK support until the 
>> release w/the feature in it is EoL'ed. That'd allow us to continue to run 
>> in-jvm upgrade dtests between the versions on the older JDK.
>> 
>> 
>> 
>> This.
>> I think the idea of adding new major JDKs to release branches for a number 
>> of reasons, in theory at least.  …
>> 
>> 
>> I *like* the idea … :) 
> 



Re: Accepting AI generated contributions

2025-06-02 Thread David Capwell
> To clarify are you saying that we should not accept AI generated code until 
> it has been looked at by a human

I think AI code would normally be the same process as normal code; the author 
and reviewers all reviewed the code; I am not against AI code in this context.

>  then written again with different "wording" to ensure that it doesn't 
> directly copy anything?

Personally I feel it’s fine to leave this to authors / reviewers for the 
moment.  There are many cases we can be confident that it’s not taking wording 
or code from others (such as linking internal code together), so adding a 
policy makes the valid use cases harder.  Right now the Apache policy is you 
must be explicit that you used AI, but might be best to expand on that and ask 
at what level, this isn’t a disqualifying question but more to help guide 
reviewers to look at things with a different eye.

For example, if a new indexing algorithm was generated using “vibe coding” the 
burden to show that this patch can be contributed lies on the author and 
reviewers, and I am not likely to want to review that patch for that reason.  
But if the author “vibe coded” a refactor migrating Repair to its own 
serializer I won’t have issues reviewing that; one patch feels legally harder 
to prove than the other.

> Or do you mean something else about the quality of "vibe coding" and how we 
> shouldn't allow it because it makes bad code?

I would hope our normal review process can handle the “bad code” part of it.  
End of day we need 2 reviewers to sign off on the quality.  I guess the 
question that really matters is when the author who did this is also a 
committer; they need to still review the patch as a normal reviewer and can’t 
drop there duties.

> On Jun 2, 2025, at 3:51 PM, Ariel Weisberg  wrote:
> 
> Hi,
> 
> To clarify are you saying that we should not accept AI generated code until 
> it has been looked at by a human and then written again with different 
> "wording" to ensure that it doesn't directly copy anything?
> 
> Or do you mean something else about the quality of "vibe coding" and how we 
> shouldn't allow it because it makes bad code? Ultimately it's the 
> contributor's (and committer's) job to ensure that their contributions meet 
> the bar for acceptance and I don't think we should tell them how to go about 
> meeting that bar beyond what is needed to address the copyright concern.
> 
> I agree that the bar set by the Apache guidelines are pretty high. They are 
> simultaneously impossible and trivial to meet depending on how you interpret 
> them and we are not very well equipped to interpret them.
> 
> It would have been more straightforward for them to simply say no, but they 
> didn't opt to do that as if there is some way for PMCs to acceptably take AI 
> generated contributions.
> 
> Ariel
> 
> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>>> fine tuning encourage not reproducing things verbatim
>>> I think not producing copyrighted output from your training data is a 
>>> technically feasible achievement for these vendors so I have a moderate 
>>> level of trust they will succeed at it if they say they do it.
>> 
>> Some team members and I discussed this in the context of my documentation 
>> patch (which utilized Claude during composition). I conducted an experiment 
>> to pose high-level Cassandra-related questions to a model without additional 
>> context, while adjusting the temperature parameter (tested at 0.2, 0.5, and 
>> 0.8). The results revealed that each test generated content copied verbatim 
>> from a specific non-Apache (and non-DSE) website. I did not verify whether 
>> this content was copyrighted, though it was easily identifiable through a 
>> simple Google search. This occurred as a single sentence within the 
>> generated document, and as I am not a legal expert, I cannot determine 
>> whether this constitutes a significant issue.
>> 
>> The complexity increases when considering models trained on different 
>> languages, which may translate content into English. In such cases, a Google 
>> search would fail to detect the origin. Is this still considered plagiarism? 
>> Does it violate copyright laws? I am uncertain.
>> 
>> Similar challenges arise with code generation. For instance, if a model is 
>> trained on a GPL-licensed Python library that implements a novel data 
>> structure, and the model subsequently rewrites this structure in Java, a 
>> Google search is unlikely to identify the source.
>> 
>> Personally, I do not assume these models will avoid producing copyrighted 
>> material. This doesn’t mean I am against AI at all, but rather reflects my 
>> belief that the requirements set by Apache are not easily “provable” in such 
>> scenarios.
>> 
>> 
>>> My personal opinion is that we should at least consider allow listing a few 
>>> specific sources (any vendor that scans output for infringement) and add 
>>> that to the PR template and in other locations (readme, web site). 

Re: Accepting AI generated contributions

2025-06-02 Thread Jeremiah Jordan
> Ultimately it's the contributor's (and committer's) job to ensure that
their contributions meet the bar for acceptance

To me this is the key point. Given how pervasive this stuff is becoming, I
don’t think it’s feasible to make some list of tools and enforce it.  Even
without getting into extra tools, IDEs (including IntelliJ) are doing more
and more LLM based code suggestion as time goes on.
I think we should point people to the ASF Guidelines around such tools, and
the guidelines around copyrighted code, and then continue to review patches
with the high standards we have always had in this project.

-Jeremiah

On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg  wrote:

> Hi,
>
> To clarify are you saying that we should not accept AI generated code
> until it has been looked at by a human and then written again with
> different "wording" to ensure that it doesn't directly copy anything?
>
> Or do you mean something else about the quality of "vibe coding" and how
> we shouldn't allow it because it makes bad code? Ultimately it's the
> contributor's (and committer's) job to ensure that their contributions meet
> the bar for acceptance and I don't think we should tell them how to go
> about meeting that bar beyond what is needed to address the copyright
> concern.
>
> I agree that the bar set by the Apache guidelines are pretty high. They
> are simultaneously impossible and trivial to meet depending on how you
> interpret them and we are not very well equipped to interpret them.
>
> It would have been more straightforward for them to simply say no, but
> they didn't opt to do that as if there is some way for PMCs to acceptably
> take AI generated contributions.
>
> Ariel
>
> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>
> fine tuning encourage not reproducing things verbatim
>
> I think not producing copyrighted output from your training data is a
> technically feasible achievement for these vendors so I have a moderate
> level of trust they will succeed at it if they say they do it.
>
>
> Some team members and I discussed this in the context of my documentation
> patch (which utilized Claude during composition). I conducted an experiment
> to pose high-level Cassandra-related questions to a model without
> additional context, while adjusting the temperature parameter (tested at
> 0.2, 0.5, and 0.8). The results revealed that each test generated content
> copied verbatim from a specific non-Apache (and non-DSE) website. I did not
> verify whether this content was copyrighted, though it was easily
> identifiable through a simple Google search. This occurred as a single
> sentence within the generated document, and as I am not a legal expert, I
> cannot determine whether this constitutes a significant issue.
>
> The complexity increases when considering models trained on different
> languages, which may translate content into English. In such cases, a
> Google search would fail to detect the origin. Is this still considered
> plagiarism? Does it violate copyright laws? I am uncertain.
>
> Similar challenges arise with code generation. For instance, if a model is
> trained on a GPL-licensed Python library that implements a novel data
> structure, and the model subsequently rewrites this structure in Java, a
> Google search is unlikely to identify the source.
>
> Personally, I do not assume these models will avoid producing copyrighted
> material. This doesn’t mean I am against AI at all, but rather reflects
> my belief that the requirements set by Apache are not easily “provable” in
> such scenarios.
>
>
> My personal opinion is that we should at least consider allow listing a
> few specific sources (any vendor that scans output for infringement) and
> add that to the PR template and in other locations (readme, web site).
> Bonus points if we can set up code scanning (useful for non-AI
> contributions!).
>
>
> My perspective, after trying to see what AI can do is the following:
>
> Strengths
> * Generating a preliminary draft of a document and assisting with
> iterative revisions
> * Documenting individual methods
> * Generation of “simple” methods and scripts, provided the underlying
> libraries are well-documented in public repositories
> * Managing repetitive or procedural tasks, such as “migrating from X to Y”
> or “converting serializations to the X interface”
>
> Limitations
> * Producing a fully functional document in a single attempt that meets
> merge standards. When documenting Gens.java and Property.java, the output
> appeared plausible but contained frequent inaccuracies.
> * Addressing complex or ambiguous scenarios (“gossip”), though this
> challenge is not unique to AI—Matt Byrd and I tested Claude for
> CASSANDRA-20659, where it could identify relevant code but proposed
> solutions that risked corrupting production clusters.
> * Interpreting large-scale codebases. Beyond approximately 300 lines of
> actual code (excluding formatting), performance degrades significantly,
> leading to a ma

Re: Cassandra 5+ JDK Minimum Compatibility Requirement

2025-06-02 Thread Vivekanand Koya
Hello Everyone,

I was debugging
https://lists.apache.org/thread/ykkwhjdpgyqzw5xtol4v5ysz664bxxl3 and found
the issue. The Result inner class has a circular dependency on its inner
classes. (
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L457).
I have refactored the Result class into an individual file (Result.java).
The refactored code compiles successfully by Ant. I am now working to
resolve the dependency between the superclass and subclass. (
https://github.com/vivekkoya/cassandra/commit/1e5178dd8a8a523eb490c753ee28ff966abe9fc3
)

At the same time, I found the deprecated warnings from using
SecurityManager annoying so I commented out all the code in src/ and test/
directories which caused those warnings and got the code to compile with
only one warning from an Ant jar file.
(
https://github.com/vivekkoya/cassandra/commit/8b7f5e150ddb678a46bd661f595a8875d1329451
)

Appreciate any feedback

Thanks,
Vivekanand K.


On Sun, May 11, 2025 at 5:27 AM Josh McKenzie  wrote:

> breaking it up into several smaller patches.
>
> This immediately made me think of poor Blake and the "remove singletons"
> sisyphean task. That was 700 smaller patches! :D  Which to be fair isn't
> "several" by any measure, but...
>
> All of which is to say - it's a continuum between big-banging it in one or
> death by 700 cuts. Not to grind an axe (I'm totally grinding an axe), but
> if we head cleanly delineated modular boundaries in this codebase with
> clear separation of concerns, we could tackle a refactor like this on a
> subsystem by subsystem basis (i.e. batch: the middle ground between big
> bang and death-by-a-thousand) to strike a sweet spot between two extremes.
>
> I'm sympathetic to the pragmatic reality of the disruption sweeping
> changes to modernize cause to a project ecosystem, but my opinion is that
> the march of time and evolution of our language ecosystem is *really*
> leaving us behind without some batched, focused work on modernization. This
> codebase has some jekyll-and-hyde vibes; when you git blame and see <= 2010
> and svn import commit messages all over a file it's very much a red flag
> that you're probably in shark-infested waters.
>
> On Sat, May 10, 2025, at 11:53 AM, Vivekanand Koya wrote:
>
> It looks like there is a potential solution to the indeterministic
> bytebuffer:
> https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/foreign/MemoryLayout.html
> & https://archive.fosdem.org/2020/schedule/event/bytebuffers/
>
> Thanks,
> Vivekanand K.
>
>
> On Fri, May 9, 2025, 8:59 PM Vivekanand Koya <13vivekk...@gmail.com>
> wrote:
>
> Made some progress. After adding 
> throughout build.xml and compiling the 5.03 branch with openjdk 17.0.15
> 2025-04-15
> OpenJDK Runtime Environment Temurin-17.0.15+6 (build 17.0.15+6) I got a
> build Failed error at the same position in exception. Please see:
> https://github.com/apache/cassandra/pull/4152
>
> While debugging, it appears there is an idiosyncrasy how Netty was used
> for efficient network operations. The unsafe casting was highlighted by the
> compiler and eventually made its way to runtime. I drew a dependency graph
> between types. It appears Java natively supports such functionality with
> Project Loom (https://openjdk.org/jeps/444) (
> https://inside.java/2021/05/10/networking-io-with-virtual-threads/). I
> understand that this is only part of the story. Please correct me if my
> reasoning is wrong, wish to learn from your experience. Wish to see your
> insights.
>
> Thanks,
> Vivekanand K.
>
> On Fri, May 9, 2025 at 1:30 PM Brandon Williams  wrote:
>
> We thought we had this figured out when we did the big bang switch to
> ByteBuffers, then spent years finding subtle bugs that the tests
> didn't.
>
> Kind Regards,
> Brandon
>
> On Fri, May 9, 2025 at 3:24 PM Jon Haddad  wrote:
> >
> > There’s a pretty simple solution here - breaking it up into several
> smaller patches.
> >
> > * Any changes should include tests that validate the checks are used
> correctly.
> > * It should also alleviate any issues with code conflicts and rebasing
> as the merges would happen slowly over time rather than all at once.
> > * If there’s two committers willing to spend time and work with OP on
> this, that should be enough to move it forward.
> > * There's a thread on user@ right now [1] where someone *just* ran into
> this issue, so I'd say addressing that one is a reasonable starting point.
> >
> > [1] https://lists.apache.org/thread/ykkwhjdpgyqzw5xtol4v5ysz664bxxl3
> >
> >
> >
> > Jon
> >
> >
> > On Fri, May 9, 2025 at 12:16 PM C. Scott Andreas 
> wrote:
> >>
> >> My thinking is most closely aligned with Blake and Benedict’s views
> here.
> >>
> >> For the specific refactor in question, I support adoption of the
> language feature for new code or to cut existing code over to the new
> syntax as changes are made to the respective areas of the codebase. But I
> don’t support a sweeping p

Re: Accepting AI generated contributions

2025-06-02 Thread Ariel Weisberg
Hi,

To clarify are you saying that we should not accept AI generated code until it 
has been looked at by a human and then written again with different "wording" 
to ensure that it doesn't directly copy anything?

Or do you mean something else about the quality of "vibe coding" and how we 
shouldn't allow it because it makes bad code? Ultimately it's the contributor's 
(and committer's) job to ensure that their contributions meet the bar for 
acceptance and I don't think we should tell them how to go about meeting that 
bar beyond what is needed to address the copyright concern.

I agree that the bar set by the Apache guidelines are pretty high. They are 
simultaneously impossible and trivial to meet depending on how you interpret 
them and we are not very well equipped to interpret them.

It would have been more straightforward for them to simply say no, but they 
didn't opt to do that as if there is some way for PMCs to acceptably take AI 
generated contributions.

Ariel

On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>> fine tuning encourage not reproducing things verbatimI think not producing 
>> copyrighted output from your training data is a technically feasible 
>> achievement for these vendors so I have a moderate level of trust they will 
>> succeed at it if they say they do it.
> 
> Some team members and I discussed this in the context of my documentation 
> patch (which utilized Claude during composition). I conducted an experiment 
> to pose high-level Cassandra-related questions to a model without additional 
> context, while adjusting the temperature parameter (tested at 0.2, 0.5, and 
> 0.8). The results revealed that each test generated content copied verbatim 
> from a specific non-Apache (and non-DSE) website. I did not verify whether 
> this content was copyrighted, though it was easily identifiable through a 
> simple Google search. This occurred as a single sentence within the generated 
> document, and as I am not a legal expert, I cannot determine whether this 
> constitutes a significant issue.
> 
> The complexity increases when considering models trained on different 
> languages, which may translate content into English. In such cases, a Google 
> search would fail to detect the origin. Is this still considered plagiarism? 
> Does it violate copyright laws? I am uncertain.
> 
> Similar challenges arise with code generation. For instance, if a model is 
> trained on a GPL-licensed Python library that implements a novel data 
> structure, and the model subsequently rewrites this structure in Java, a 
> Google search is unlikely to identify the source.
> 
> Personally, I do not assume these models will avoid producing copyrighted 
> material. This doesn’t mean I am against AI at all, but rather reflects my 
> belief that the requirements set by Apache are not easily “provable” in such 
> scenarios.
> 
> 
>> My personal opinion is that we should at least consider allow listing a few 
>> specific sources (any vendor that scans output for infringement) and add 
>> that to the PR template and in other locations (readme, web site). Bonus 
>> points if we can set up code scanning (useful for non-AI contributions!).
> 
> My perspective, after trying to see what AI can do is the following:
> 
> Strengths
> * Generating a preliminary draft of a document and assisting with iterative 
> revisions

> * Documenting individual methods

> * Generation of “simple” methods and scripts, provided the underlying 
> libraries are well-documented in public repositories

> * Managing repetitive or procedural tasks, such as “migrating from X to Y” or 
> “converting serializations to the X interface”
> 
> Limitations
> * Producing a fully functional document in a single attempt that meets merge 
> standards. When documenting Gens.java and Property.java, the output appeared 
> plausible but contained frequent inaccuracies.
> * Addressing complex or ambiguous scenarios (“gossip”), though this challenge 
> is not unique to AI—Matt Byrd and I tested Claude for CASSANDRA-20659, where 
> it could identify relevant code but proposed solutions that risked corrupting 
> production clusters.

> * Interpreting large-scale codebases. Beyond approximately 300 lines of 
> actual code (excluding formatting), performance degrades significantly, 
> leading to a marked decline in output quality.
> 
> Note: When referring to AI/LLMs, I am not discussing interactions with a user 
> interface to execute specific tasks, but rather leveraging code agents like 
> Roo and Aider to provide contextual information to the LLM.
> 
> Given these observations, it remains challenging to determine optimal 
> practices. In some contexts its very clear to tell that nothing was taking 
> from external work (e.g., “create a test using our BTree class that inserts a 
> row with a null column,” “analyze this function’s purpose”). However, for 
> substantial tasks, the situation becomes more complex. If the author employed 
> AI as a collab

Re: [DISCUSS] How we handle JDK support

2025-06-02 Thread Ekaterina Dimitrova
I think the risk outlined here is real only if we don’t run the upgrade
tests to trunk, no?

That should be probably the ultimate safeguard. And while pre-commit we do
not run all tests for every patch - I think that does not apply to JDK
addition/removal where we would run all suites normally pre-commit due to
the nature of the change and the blast it has.

On Mon, 2 Jun 2025 at 18:17, Josh McKenzie  wrote:

> I originally had "everyone supports highest language level whee" which of
> course would fail to build on older branches.
>
> So this new paradigm would give us the following branch:language level
> support (assuming JDK bump on each release which also won't always happen):
> - trunk: latest
> - trunk-1: latest lang - 1
> - trunk-2: latest lang - 2
>
> So while trunk-1 and trunk-2 would both *support* the newest JDK
> (wherever possible) for runtime, they wouldn't be switched to the new
> language level. That'd leave us able to use the newest language features on
> trunk much more rapidly while *effectively snapshotting the supported
> language on older branches to the lowest JDK they support* (which, when
> they're last in line and about to fall off, is the JDK that was newest at
> the time they came about).
>
> Our risk would be patches going to trunk targeting new language features
> we then found out we needed to back-port would require some massaging to be
> compatible with older branches. I suspect that'll be a rare edge-case so
> seems ok?
>
> Unless I'm completely missing something. I was the one who originally just
> wanted to "latest JDK All The Things" for a hot minute there. =/
>
> On Mon, Jun 2, 2025, at 9:40 AM, Doug Rohrer wrote:
>
> Only thing I’d suggest changing here is “Trunk targets the language level
> of that JDK” shouldn’t happen until after we’ve confirmed the back port of
> the new JDK LTS changes to previous versions - otherwise, you have folks
> starting to use new language features and then have to rip them all out
> when you find that some previous supported Cassandra release can’t use that
> JDK.
>
> Doug
>
> On May 27, 2025, at 10:37 AM, Josh McKenzie  wrote:
>
> revised snapshot of the state of conversation here:
>
> *[New LTS JDK Adoption]*
>
>- Trunk supports 1 JDK at a time
>- After a branch is cut for a release, we push to get trunk to support
>latest LTS JDK version available at that time
>- Trunk targets the language level of that JDK
>- CI on trunk is that single JDK only
>- We merge new JDK LTS support to all supported branches at the same
>time as trunk
>   - In the very rare case a feature would have to be removed due to
>   JDK change (think UDF's scripting engine), we instead keep the maximum
>   allowable JDK for that feature supported on trunk and subsequent 
> releases.
>   We then drop that JDK across all branches once the oldest C* w/that 
> feature
>   ages out of support.
>- Otherwise, we don't need to worry about dropping JDK support as that
>will happen naturally w/the dropping of support for a branch. Branches will
>slowly gain JDK support w/each subsequent trunk-based LTS integration.
>
> *[Branch JDK Support]*
>
>- N-2: JDK, JDK-1, JDK-2
>- N-1: JDK, JDK-1
>- N: JDK
>
> *[CI, JDK's, Upgrades]*
>
>- CI:
>   - For each branch we run per-commit CI for the latest JDK they
>   support
>   - *TODO: *Periodically we run all CI pipelines for older JDK's
>   per-branch (cadence TBD)
>   - *TODO: *We add basic perf testing across all GA branches with
>   reference workloads (easy-cass-stress workloads?
>   
> 
>   )
>- Upgrades
>   - N-2 -> N-1: tested on JDK and JDK-1
>   - N-2 -> N: tested on JDK
>   - N-1 -> N: tested on JDK
>
>
> ---
> The above has 2 non-trivial CI orchestration investments:
>
>1. Running all CI across all supported JDK on a cadence
>2. Adding some basic perf smoke tests
>
> Both seem reasonable to me.
>
> On Fri, May 23, 2025, at 7:39 AM, Mick Semb Wever wrote:
>
>
>.
>
>
>
> For the rare edge case where we have to stop supporting something entirely
> because it's incompatible with a JDK release (has this happened more than
> the 1 time?) - I think a reasonable fallback is to just not backport new
> JDK support and consider carrying forward the older JDK support until the
> release w/the feature in it is EoL'ed. That'd allow us to continue to run
> in-jvm upgrade dtests between the versions on the older JDK.
>
>
>
> This.
> I think the idea of adding new major JDKs to release branches for a number
> of reasons, in theory at least.  …
>
>
>
> I *like* the idea … :)
>
>
>
>


Re: Accepting AI generated contributions

2025-06-02 Thread Jeremiah Jordan
I don’t think I said we should abdicate responsibility?  I said the key
point is that contributors, and more importantly reviewers and committers
understand the ASF guidelines and hold all code to those standards. Any
suspect code should be blocked during review. As Roman says in your quote,
this isn’t about AI, it’s about copyright. If someone submits copyrighted
code to the project, whether an AI generated it or they just grabbed it
from a Google search, it’s on the project to try not to accept it.
I don’t think anyone is going to be able to maintain and enforce a list of
acceptable tools for contributors to the project to stick to. We can’t know
what someone did on their laptop, all we can do is evaluate the code they
submit.

-Jeremiah

On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg  wrote:

> Hi,
>
> As PMC members/committers we aren't supposed to abdicate this to legal or
> to contributors. Despite the fact that we aren't equipped to solve this
> problem we are supposed to be making sure that code contributed is
> non-infringing.
>
> This is a quotation from Roman Shaposhnik from this legal thread
> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd
>
> Yes, because you have to. Again -- forget about AI -- if a drive-by
> contributor submits a patch that has huge amounts of code stolen from some
> existing copyright holder -- it is very much ON YOU as a committer/PMC to
> prevent that from happening.
>
>
> We aren't supposed to knowingly allow people to use AI tools that are
> known to generate infringing contributions or contributions which are not
> license compatible (such as OpenAI terms of use).
>
> Ariel
> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote:
>
> > Ultimately it's the contributor's (and committer's) job to ensure that
> their contributions meet the bar for acceptance
> To me this is the key point. Given how pervasive this stuff is becoming, I
> don’t think it’s feasible to make some list of tools and enforce it.  Even
> without getting into extra tools, IDEs (including IntelliJ) are doing more
> and more LLM based code suggestion as time goes on.
> I think we should point people to the ASF Guidelines around such tools,
> and the guidelines around copyrighted code, and then continue to review
> patches with the high standards we have always had in this project.
>
> -Jeremiah
>
> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg  wrote:
>
>
> Hi,
>
> To clarify are you saying that we should not accept AI generated code
> until it has been looked at by a human and then written again with
> different "wording" to ensure that it doesn't directly copy anything?
>
> Or do you mean something else about the quality of "vibe coding" and how
> we shouldn't allow it because it makes bad code? Ultimately it's the
> contributor's (and committer's) job to ensure that their contributions meet
> the bar for acceptance and I don't think we should tell them how to go
> about meeting that bar beyond what is needed to address the copyright
> concern.
>
> I agree that the bar set by the Apache guidelines are pretty high. They
> are simultaneously impossible and trivial to meet depending on how you
> interpret them and we are not very well equipped to interpret them.
>
> It would have been more straightforward for them to simply say no, but
> they didn't opt to do that as if there is some way for PMCs to acceptably
> take AI generated contributions.
>
> Ariel
>
> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>
> fine tuning encourage not reproducing things verbatim
>
> I think not producing copyrighted output from your training data is a
> technically feasible achievement for these vendors so I have a moderate
> level of trust they will succeed at it if they say they do it.
>
>
> Some team members and I discussed this in the context of my documentation
> patch (which utilized Claude during composition). I conducted an experiment
> to pose high-level Cassandra-related questions to a model without
> additional context, while adjusting the temperature parameter (tested at
> 0.2, 0.5, and 0.8). The results revealed that each test generated content
> copied verbatim from a specific non-Apache (and non-DSE) website. I did not
> verify whether this content was copyrighted, though it was easily
> identifiable through a simple Google search. This occurred as a single
> sentence within the generated document, and as I am not a legal expert, I
> cannot determine whether this constitutes a significant issue.
>
> The complexity increases when considering models trained on different
> languages, which may translate content into English. In such cases, a
> Google search would fail to detect the origin. Is this still considered
> plagiarism? Does it violate copyright laws? I am uncertain.
>
> Similar challenges arise with code generation. For instance, if a model is
> trained on a GPL-licensed Python library that implements a novel data
> structure, and the model subsequently rewrites this structu

Re: Accepting AI generated contributions

2025-06-02 Thread Ariel Weisberg
Hi,

As PMC members/committers we aren't supposed to abdicate this to legal or to 
contributors. Despite the fact that we aren't equipped to solve this problem we 
are supposed to be making sure that code contributed is non-infringing.

This is a quotation from Roman Shaposhnik from this legal thread 
https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd

> Yes, because you have to. Again -- forget about AI -- if a drive-by 
> contributor submits a patch that has huge amounts of code stolen from some 
> existing copyright holder -- it is very much ON YOU as a committer/PMC to 
> prevent that from happening.

We aren't supposed to knowingly allow people to use AI tools that are known to 
generate infringing contributions or contributions which are not license 
compatible (such as OpenAI terms of use).

Ariel
On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote:
> > Ultimately it's the contributor's (and committer's) job to ensure that 
> > their contributions meet the bar for acceptance
> To me this is the key point. Given how pervasive this stuff is becoming, I 
> don’t think it’s feasible to make some list of tools and enforce it.  Even 
> without getting into extra tools, IDEs (including IntelliJ) are doing more 
> and more LLM based code suggestion as time goes on.
> I think we should point people to the ASF Guidelines around such tools, and 
> the guidelines around copyrighted code, and then continue to review patches 
> with the high standards we have always had in this project.
> 
> -Jeremiah
> 
> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg  wrote:
>> __
>> Hi,
>> 
>> To clarify are you saying that we should not accept AI generated code until 
>> it has been looked at by a human and then written again with different 
>> "wording" to ensure that it doesn't directly copy anything?
>> 
>> Or do you mean something else about the quality of "vibe coding" and how we 
>> shouldn't allow it because it makes bad code? Ultimately it's the 
>> contributor's (and committer's) job to ensure that their contributions meet 
>> the bar for acceptance and I don't think we should tell them how to go about 
>> meeting that bar beyond what is needed to address the copyright concern.
>> 
>> I agree that the bar set by the Apache guidelines are pretty high. They are 
>> simultaneously impossible and trivial to meet depending on how you interpret 
>> them and we are not very well equipped to interpret them.
>> 
>> It would have been more straightforward for them to simply say no, but they 
>> didn't opt to do that as if there is some way for PMCs to acceptably take AI 
>> generated contributions.
>> 
>> Ariel
>> 
>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
 fine tuning encourage not reproducing things verbatimI think not producing 
 copyrighted output from your training data is a technically feasible 
 achievement for these vendors so I have a moderate level of trust they 
 will succeed at it if they say they do it.
>>> 
>>> Some team members and I discussed this in the context of my documentation 
>>> patch (which utilized Claude during composition). I conducted an experiment 
>>> to pose high-level Cassandra-related questions to a model without 
>>> additional context, while adjusting the temperature parameter (tested at 
>>> 0.2, 0.5, and 0.8). The results revealed that each test generated content 
>>> copied verbatim from a specific non-Apache (and non-DSE) website. I did not 
>>> verify whether this content was copyrighted, though it was easily 
>>> identifiable through a simple Google search. This occurred as a single 
>>> sentence within the generated document, and as I am not a legal expert, I 
>>> cannot determine whether this constitutes a significant issue.
>>> 
>>> The complexity increases when considering models trained on different 
>>> languages, which may translate content into English. In such cases, a 
>>> Google search would fail to detect the origin. Is this still considered 
>>> plagiarism? Does it violate copyright laws? I am uncertain.
>>> 
>>> Similar challenges arise with code generation. For instance, if a model is 
>>> trained on a GPL-licensed Python library that implements a novel data 
>>> structure, and the model subsequently rewrites this structure in Java, a 
>>> Google search is unlikely to identify the source.
>>> 
>>> Personally, I do not assume these models will avoid producing copyrighted 
>>> material. This doesn’t mean I am against AI at all, but rather reflects my 
>>> belief that the requirements set by Apache are not easily “provable” in 
>>> such scenarios.
>>> 
>>> 
 My personal opinion is that we should at least consider allow listing a 
 few specific sources (any vendor that scans output for infringement) and 
 add that to the PR template and in other locations (readme, web site). 
 Bonus points if we can set up code scanning (useful for non-AI 
 contributions!).
>>> 
>>> My perspective, after trying to see what AI

Re: Accepting AI generated contributions

2025-06-02 Thread David Capwell
> fine tuning encourage not reproducing things verbatim
> I think not producing copyrighted output from your training data is a 
> technically feasible achievement for these vendors so I have a moderate level 
> of trust they will succeed at it if they say they do it.

Some team members and I discussed this in the context of my documentation patch 
(which utilized Claude during composition). I conducted an experiment to pose 
high-level Cassandra-related questions to a model without additional context, 
while adjusting the temperature parameter (tested at 0.2, 0.5, and 0.8). The 
results revealed that each test generated content copied verbatim from a 
specific non-Apache (and non-DSE) website. I did not verify whether this 
content was copyrighted, though it was easily identifiable through a simple 
Google search. This occurred as a single sentence within the generated 
document, and as I am not a legal expert, I cannot determine whether this 
constitutes a significant issue.

The complexity increases when considering models trained on different 
languages, which may translate content into English. In such cases, a Google 
search would fail to detect the origin. Is this still considered plagiarism? 
Does it violate copyright laws? I am uncertain.

Similar challenges arise with code generation. For instance, if a model is 
trained on a GPL-licensed Python library that implements a novel data 
structure, and the model subsequently rewrites this structure in Java, a Google 
search is unlikely to identify the source.

Personally, I do not assume these models will avoid producing copyrighted 
material. This doesn’t mean I am against AI at all, but rather reflects my 
belief that the requirements set by Apache are not easily “provable” in such 
scenarios.


> My personal opinion is that we should at least consider allow listing a few 
> specific sources (any vendor that scans output for infringement) and add that 
> to the PR template and in other locations (readme, web site). Bonus points if 
> we can set up code scanning (useful for non-AI contributions!).


My perspective, after trying to see what AI can do is the following:

Strengths
* Generating a preliminary draft of a document and assisting with iterative 
revisions
  
* Documenting individual methods
   
* Generation of “simple” methods and scripts, provided the underlying libraries 
are well-documented in public repositories
 
* Managing repetitive or procedural tasks, such as “migrating from X to Y” or 
“converting serializations to the X interface”

Limitations
* Producing a fully functional document in a single attempt that meets merge 
standards. When documenting Gens.java and Property.java, the output appeared 
plausible but contained frequent inaccuracies.
* Addressing complex or ambiguous scenarios (“gossip”), though this challenge 
is not unique to AI—Matt Byrd and I tested Claude for CASSANDRA-20659, where it 
could identify relevant code but proposed solutions that risked corrupting 
production clusters.
  
* Interpreting large-scale codebases. Beyond approximately 300 lines of actual 
code (excluding formatting), performance degrades significantly, leading to a 
marked decline in output quality.

Note: When referring to AI/LLMs, I am not discussing interactions with a user 
interface to execute specific tasks, but rather leveraging code agents like Roo 
and Aider to provide contextual information to the LLM.

Given these observations, it remains challenging to determine optimal 
practices. In some contexts its very clear to tell that nothing was taking from 
external work (e.g., “create a test using our BTree class that inserts a row 
with a null column,” “analyze this function’s purpose”). However, for 
substantial tasks, the situation becomes more complex. If the author employed 
AI as a collaborative tool during “pair programming,” concerns are not really 
that different than google searches (unless the work involves unique elements 
like introducing new data structures or indexes). Conversely, if the author 
“vibe coded” the entire patch, two primary concerns arise: does the author have 
writes to the code and whether its quality aligns with requirements.


TL;DR - I am not against AI contributions, but strongly prefer its done as 
“pair programing”.  My experience with “vibe coding” makes me worry about the 
quality of the code, and that the author is less likely to validate that the 
code generated is safe to donate.

This email was generated with the help of AI =)


> On May 30, 2025, at 3:00 PM, Ariel Weisberg  wrote:
> 
> Hi all,
> 
> It looks like we haven't discussed this much and haven't settled on a policy 
> for what kinds of AI generated contributions we accept and what vetting is 
> required for them.
> 
> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results.
> 
> ```
> Given the above, code generated in whole or in part using AI can be 
> contributed if the contr

Re: [DISCUSS] How we handle JDK support

2025-06-02 Thread Josh McKenzie
I originally had "everyone supports highest language level whee" which of 
course would fail to build on older branches.

So this new paradigm would give us the following branch:language level support 
(assuming JDK bump on each release which also won't always happen):
- trunk: latest
- trunk-1: latest lang - 1
- trunk-2: latest lang - 2

So while trunk-1 and trunk-2 would both *support* the newest JDK (wherever 
possible) for runtime, they wouldn't be switched to the new language level. 
That'd leave us able to use the newest language features on trunk much more 
rapidly while *effectively snapshotting the supported language on older 
branches to the lowest JDK they support* (which, when they're last in line and 
about to fall off, is the JDK that was newest at the time they came about).

Our risk would be patches going to trunk targeting new language features we 
then found out we needed to back-port would require some massaging to be 
compatible with older branches. I suspect that'll be a rare edge-case so seems 
ok?

Unless I'm completely missing something. I was the one who originally just 
wanted to "latest JDK All The Things" for a hot minute there. =/

On Mon, Jun 2, 2025, at 9:40 AM, Doug Rohrer wrote:
> Only thing I’d suggest changing here is “Trunk targets the language level of 
> that JDK” shouldn’t happen until after we’ve confirmed the back port of the 
> new JDK LTS changes to previous versions - otherwise, you have folks starting 
> to use new language features and then have to rip them all out when you find 
> that some previous supported Cassandra release can’t use that JDK.
> 
> Doug
> 
>> On May 27, 2025, at 10:37 AM, Josh McKenzie  wrote:
>> 
>> revised snapshot of the state of conversation here:
>> 
>> *[New LTS JDK Adoption]*
>>  • Trunk supports 1 JDK at a time
>>  • After a branch is cut for a release, we push to get trunk to support 
>> latest LTS JDK version available at that time
>>  • Trunk targets the language level of that JDK
>>  • CI on trunk is that single JDK only
>>  • We merge new JDK LTS support to all supported branches at the same time 
>> as trunk
>>• In the very rare case a feature would have to be removed due to JDK 
>> change (think UDF's scripting engine), we instead keep the maximum allowable 
>> JDK for that feature supported on trunk and subsequent releases. We then 
>> drop that JDK across all branches once the oldest C* w/that feature ages out 
>> of support.
>>  • Otherwise, we don't need to worry about dropping JDK support as that will 
>> happen naturally w/the dropping of support for a branch. Branches will 
>> slowly gain JDK support w/each subsequent trunk-based LTS integration.
>> *[Branch JDK Support]*
>>  • N-2: JDK, JDK-1, JDK-2
>>  • N-1: JDK, JDK-1
>>  • N: JDK
>> *[CI, JDK's, Upgrades]*
>>  • CI:
>>• For each branch we run per-commit CI for the latest JDK they support
>>• *TODO: *Periodically we run all CI pipelines for older JDK's per-branch 
>> (cadence TBD)
>>• *TODO: *We add basic perf testing across all GA branches with reference 
>> workloads (easy-cass-stress workloads? 
>> )
>>  • Upgrades
>>• N-2 -> N-1: tested on JDK and JDK-1
>>• N-2 -> N: tested on JDK
>>• N-1 -> N: tested on JDK
>> 
>> ---
>> The above has 2 non-trivial CI orchestration investments:
>>  1. Running all CI across all supported JDK on a cadence
>>  2. Adding some basic perf smoke tests
>> Both seem reasonable to me.
>> 
>> On Fri, May 23, 2025, at 7:39 AM, Mick Semb Wever wrote:
>>>  
>>>.
>>>   
>>>   
> For the rare edge case where we have to stop supporting something 
> entirely because it's incompatible with a JDK release (has this happened 
> more than the 1 time?) - I think a reasonable fallback is to just not 
> backport new JDK support and consider carrying forward the older JDK 
> support until the release w/the feature in it is EoL'ed. That'd allow us 
> to continue to run in-jvm upgrade dtests between the versions on the 
> older JDK.
> 
 
 
 This.
 I think the idea of adding new major JDKs to release branches for a number 
 of reasons, in theory at least.  …
>>> 
>>> 
>>> I *like* the idea … :) 
>>