[ANNOUNCE] Apache Cassandra Analytics 0.1.0 test artifact available

2025-06-24 Thread Bernardo Botella
The test build of Cassandra Analytics 0.1.0 is available.

sha1: 9c948eab9356f5d166c26bb7a155b99ee0a8f9db
Git: https://github.com/apache/cassandra-analytics/tree/0.1.0-tentative
Maven Artifacts:
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-codec_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-sidecar_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-common_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core-example_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-spark-converter_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-bridge_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-codec_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-sidecar_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-common_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core-example_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-spark-converter_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-bridge_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-codec_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-sidecar_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-common_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core-example_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-spark-converter_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-bridge_spark3_2.12/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-codec_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc-sidecar_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-cdc_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-common_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core-example_spark3_2.13/0.1.0/
https://repository.apache.org/content/repositories/orgapachecassandra-1402/org/apache/cassandra/spark/analytics-cassandra-analytics-core_spark3_2.13/0.1.0/
https://repository.apache.org/c

Re: Accepting AI generated contributions

2025-06-24 Thread David Capwell
> It's not clear from that thread precisely what they are objecting to and
whether it has changed (another challenge!)

That thread was last updated in 2023 and the current stance is just "tell
people which one you used, and make sure the output follows the 3 main
points".

> We can make a best effort to just vet the ones that people actually want
to widely use and refuse everything else and be better off than allowing
people to use tools that are known not to be license compatible or make
little/no effort to avoid reproducing large amounts of copyrighted code.

How often are we going to "vet" new tools?  These change very often and
it's a constant moving target.  Are we going to expect someone to do this
vetting, give the pros/cons of what has changed since the last vote, then
revote every 6 months?  What does "vet" even mean?

> allowing people to use tools that are known not to be license compatible

Which tools are you referring to?  The major providers all document that
the output is owned by the entity that requested it.

> make little/no effort to avoid reproducing large amounts of copyrighted
code.

How do you go about qualifying that?  Which tools / services are you
referring to?  How do you go about evaluating them?

> If someone submits copyrighted code to the project, whether an AI
generated it or they just grabbed it from a Google search, it’s on the
project to try not to accept it.

I am in this camp at the moment, AI vs Human has the same problem for the
reviewer; we are supposed to be doing this, and blocking AI or putting new
rules around AI doesn't really change anything, we are still supposed to do
this work.

> What would you want?

My vote would be on 2/3 given the list from Ariel.  But I am personally in
the stance that disclosure (which is the ASF policy) is best for the time
being; nothing in this thread has motivated me to change the current policy.

On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin  wrote:

> I'm on with the allow list(1) or option 2.  3 just isn't realistic
> anymore.
>
> Patrick
>
>
>
> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe 
> wrote:
>
>> I haven't participated much here, but my vote would be basically #1, i.e.
>> an "allow list" with a clear procedure for expansion.
>>
>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg  wrote:
>>
>>> Hi,
>>>
>>> We could, but if the allow list is binding then it's still an allow list
>>> with some guidance on how to expand the allow list.
>>>
>>> If it isn't binding then it's guidance so still option 2 really.
>>>
>>> I think the key distinction to find some early consensus on if we do a
>>> binding allow list or guidance, and then we can iron out the guidance, but
>>> I think that will be less controversial to work out.
>>>
>>> Or option 3 which is not accepting AI generated contributions. I think
>>> there are some with healthy skepticism of AI generated code, but so far I
>>> haven't met anyone who wants to forbid it entirely.
>>>
>>> Ariel
>>>
>>> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote:
>>>
>>> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's
>>> an allow list. If you're using something not on that allow list, here's
>>> some basic guidance and maybe let us know how you tried to mitigate some of
>>> this risk so we can update our allow list w/some nuance".
>>>
>>> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>>
>>> Where are you getting this from?  From the OpenAI terms of use:
>>> https://openai.com/policies/terms-of-use/
>>>
>>> Direct from the ASF legal mailing list discussion I linked to in my
>>> original email calling this out
>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's
>>> not clear from that thread precisely what they are objecting to and whether
>>> it has changed (another challenge!), but I believe it's restrictions on
>>> what you are allowed to do with the output of OpenAI models. And if you get
>>> the output via other service's it's under a different license and it's fine!
>>>
>>> Already we are demonstrating that it is not trivial understand what is
>>> and isn't allowed
>>>
>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>>
>>> I still maintain that trying to publish an exhaustive list of acceptable
>>> tools does not seem reasonable.
>>> But I agree that giving people guidance is possible.  Maybe having a
>>> statement in the contribution guidelines along the lines of:
>>>
>>> The list doesn't need to be exhaustive. We are not required to accept AI
>>> generated code at all!
>>>
>>> We can make a best effort to just vet the ones that people actually want
>>> to widely use and refuse everything else and be better off than allowing
>>> people to use tools that are known not to be license compatible or make
>>> little/no effort to avoid reproducing large amounts of copyrighted code.
>>>
>>> “Make sure your tools do X, here are some that

Re: Accepting AI generated contributions

2025-06-24 Thread Josh McKenzie
> These change very often and it's a constant moving target.
This is not hyperbole. This area is moving faster than anything I've seen 
before.

> I am in this camp at the moment, AI vs Human has the same problem for the 
> reviewer; we are supposed to be doing this, and blocking AI or putting new 
> rules around AI doesn't really change anything, we are still supposed to do 
> this work.  
+1. 

> I am personally in the stance that disclosure (which is the ASF policy) is 
> best for the time being; nothing in this thread has motivated me to change 
> the current policy.
Yep. Option 2 - guidance and disclosure makes the most sense to me after 
reading this thread.

On Tue, Jun 24, 2025, at 5:09 PM, David Capwell wrote:
> > It's not clear from that thread precisely what they are objecting to and 
> > whether it has changed (another challenge!)
> 
> That thread was last updated in 2023 and the current stance is just "tell 
> people which one you used, and make sure the output follows the 3 main 
> points".  
> 
> > We can make a best effort to just vet the ones that people actually want to 
> > widely use and refuse everything else and be better off than allowing 
> > people to use tools that are known not to be license compatible or make 
> > little/no effort to avoid reproducing large amounts of copyrighted code.
> 
> How often are we going to "vet" new tools?  These change very often and it's 
> a constant moving target.  Are we going to expect someone to do this vetting, 
> give the pros/cons of what has changed since the last vote, then revote every 
> 6 months?  What does "vet" even mean?  
> 
> > allowing people to use tools that are known not to be license compatible
> 
> Which tools are you referring to?  The major providers all document that the 
> output is owned by the entity that requested it.  
> 
> > make little/no effort to avoid reproducing large amounts of copyrighted 
> > code.
> 
> How do you go about qualifying that?  Which tools / services are you 
> referring to?  How do you go about evaluating them?
> 
> > If someone submits copyrighted code to the project, whether an AI generated 
> > it or they just grabbed it from a Google search, it’s on the project to try 
> > not to accept it.
> 
> I am in this camp at the moment, AI vs Human has the same problem for the 
> reviewer; we are supposed to be doing this, and blocking AI or putting new 
> rules around AI doesn't really change anything, we are still supposed to do 
> this work.  
> 
> > What would you want?
> 
> My vote would be on 2/3 given the list from Ariel.  But I am personally in 
> the stance that disclosure (which is the ASF policy) is best for the time 
> being; nothing in this thread has motivated me to change the current policy.
> 
> On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin  wrote:
>> I'm on with the allow list(1) or option 2.  3 just isn't realistic anymore. 
>> 
>> Patrick
>> 
>> 
>> 
>> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe  
>> wrote:
>>> I haven't participated much here, but my vote would be basically #1, i.e. 
>>> an "allow list" with a clear procedure for expansion.
>>> 
>>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg  wrote:
 __
 Hi,
 
 We could, but if the allow list is binding then it's still an allow list 
 with some guidance on how to expand the allow list.
 
 If it isn't binding then it's guidance so still option 2 really.
 
 I think the key distinction to find some early consensus on if we do a 
 binding allow list or guidance, and then we can iron out the guidance, but 
 I think that will be less controversial to work out.
 
 Or option 3 which is not accepting AI generated contributions. I think 
 there are some with healthy skepticism of AI generated code, but so far I 
 haven't met anyone who wants to forbid it entirely.
 
 Ariel
 
 On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote:
> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an 
> allow list. If you're using something not on that allow list, here's some 
> basic guidance and maybe let us know how you tried to mitigate some of 
> this risk so we can update our allow list w/some nuance".
> 
> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote:
>> Hi,
>> 
>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>> Where are you getting this from?  From the OpenAI terms of use: 
>>> https://openai.com/policies/terms-of-use/
>> Direct from the ASF legal mailing list discussion I linked to in my 
>> original email calling this out 
>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's 
>> not clear from that thread precisely what they are objecting to and 
>> whether it has changed (another challenge!), but I believe it's 
>> restrictions on what you are allowed to do with the output of OpenAI 
>> models. And if you get the output via oth

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-24 Thread Blake Eggleston
Those are both fair points. Once you start dealing with data loss though, 
maintaining guarantees is often impossible, so I’m not sure that torn writes or 
updated timestamps are dealbreakers, but I’m fine with tabling option 2 for now 
and seeing if we can figure out something better.

Regarding the assassin cells, if you wanted to avoid explicitly agreeing on a 
value, you might be able to only issue them for repaired base data, which has 
been implicitly agreed upon.

I think that or something like it is worth exploring. The idea would be to 
solve this issue as completely as anti-compaction would - but without having to 
rewrite sstables. I’d be interested to hear any ideas you have about how that 
might work.

You basically need a mechanism to erase some piece of data that was written 
before a given wall clock time - regardless of cell timestamp, and without 
precluding future updates (in wall clock time) with earlier timestamps.

On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
> In the second option, we use the repair timestamp to re-update any cell or 
> row we want to fix in the base table. This approach is problematic because it 
> alters the write time of user-supplied data. Although Cassandra does not 
> allow users to set timestamps for LWT writes, users may still rely on the 
> update time. A key limitation of this approach is that it cannot fix cases 
> where a view cell ends up in a future state while the base table remains 
> correct. I now understand your point that Cassandra cannot handle this 
> scenario today. However, as I mentioned earlier, the important distinction is 
> that when this issue occurs in the base table, we accept the "incorrect" data 
> as valid—but this is not acceptable for materialized views, since the source 
> of truth (the base table) still holds the correct data.
> 
> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston  wrote:
>> __
>> > Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>> > sooner.
>> 
>> No worries, I’ll be taking some time off soon as well.
>> 
>> > I don’t think the first or second option is ideal. We should treat the 
>> > base table as the source of truth. Modifying it—or forcing an update on 
>> > it, even if it’s just a timestamp change—is not a good approach and won’t 
>> > solve all problems.
>> 
>> I agree the first option probably isn’t the right way to go. Could you say a 
>> bit more about why the second option is not a good approach and which 
>> problems it won’t solve?
>> 
>> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>>> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>>> sooner.
>>> 
>>> > First - we interpret view data with higher timestamps than the base table 
>>> > as data that’s missing from the base and replicate it into the base 
>>> > table. The timestamp of the missing data may be below the paxos timestamp 
>>> > low bound so we’d have to adjust the paxos coordination logic to allow 
>>> > that in this case. Depending on how the view got this way it may also 
>>> > tear writes to the base table, breaking the write atomicity promise.
>>> 
>>> As discussed earlier, we want this MV repair mechanism to handle all edge 
>>> cases. However, it would be difficult to design it in a way that detects 
>>> the root cause of each mismatch and repairs it accordingly. Additionally, 
>>> as you mentioned, this approach could introduce other issues, such as 
>>> violating the write atomicity guarantee.
>>> 
>>> > Second - If this happens it means that we’ve either lost base table data 
>>> > or paxos metadata. If that happened, we could force a base table update 
>>> > that rewrites the current base state with new timestamps making the extra 
>>> > view data removable. However this wouldn’t fix the case where the view 
>>> > cell has a timestamp from the future - although that’s not a case that C* 
>>> > can fix today either.
>>> 
>>> I don’t think the first or second option is ideal. We should treat the base 
>>> table as the source of truth. Modifying it—or forcing an update on it, even 
>>> if it’s just a timestamp change—is not a good approach and won’t solve all 
>>> problems.
>>> 
>>> > the idea to use anti-compaction makes a lot more sense now (in principle 
>>> > - I don’t think it’s workable in practice)
>>> 
>>> I have one question regarding anti-compaction. Is the main concern that 
>>> processing too much data during anti-compaction could cause issues for the 
>>> cluster? 
>>> 
>>> > I guess you could add some sort of assassin cell that is meant to remove 
>>> > a cell with a specific timestamp and value, but is otherwise invisible. 
>>> 
>>> The idea of the assassination cell is interesting. To prevent data from 
>>> being incorrectly removed during the repair process, we need to ensure a 
>>> quorum of nodes is available and agrees on the same value before repairing 
>>> a materialized view (MV) row or cell. However, this could be very 
>>> expensive, as it re

Re: Accepting AI generated contributions

2025-06-24 Thread David Capwell
Spoke with Ariel in slack.

https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16


Not sure If I am missing more statements, but the one I can find was that the 
TOC of OpenAI has the following words: "Use Output to develop models that 
compete with OpenAI” (See here https://openai.com/policies/terms-of-use/). That 
thread then updated https://www.apache.org/legal/generative-tooling.html with 
the wording "The terms and conditions of the generative AI tool do not place 
any restrictions on use of the output that would be inconsistent with the Open 
Source Definition.” And argues that that wording is what causes issues.


I am fine having an exclude list, where we can add tools / services that we can 
reference what exactly in the TOC is in violation; that way changes to the TOC 
can trigger removal from the list.  The “why” and how to get removed should be 
very clear and non-debatable.


> On Jun 24, 2025, at 2:19 PM, Josh McKenzie  wrote:
> 
>> These change very often and it's a constant moving target.
> This is not hyperbole. This area is moving faster than anything I've seen 
> before.
> 
>> I am in this camp at the moment, AI vs Human has the same problem for the 
>> reviewer; we are supposed to be doing this, and blocking AI or putting new 
>> rules around AI doesn't really change anything, we are still supposed to do 
>> this work.  
> +1. 
> 
>> I am personally in the stance that disclosure (which is the ASF policy) is 
>> best for the time being; nothing in this thread has motivated me to change 
>> the current policy.
> Yep. Option 2 - guidance and disclosure makes the most sense to me after 
> reading this thread.
> 
> On Tue, Jun 24, 2025, at 5:09 PM, David Capwell wrote:
>> > It's not clear from that thread precisely what they are objecting to and 
>> > whether it has changed (another challenge!)
>> 
>> That thread was last updated in 2023 and the current stance is just "tell 
>> people which one you used, and make sure the output follows the 3 main 
>> points".  
>> 
>> > We can make a best effort to just vet the ones that people actually want 
>> > to widely use and refuse everything else and be better off than allowing 
>> > people to use tools that are known not to be license compatible or make 
>> > little/no effort to avoid reproducing large amounts of copyrighted code.
>> 
>> How often are we going to "vet" new tools?  These change very often and it's 
>> a constant moving target.  Are we going to expect someone to do this 
>> vetting, give the pros/cons of what has changed since the last vote, then 
>> revote every 6 months?  What does "vet" even mean?  
>> 
>> > allowing people to use tools that are known not to be license compatible
>> 
>> Which tools are you referring to?  The major providers all document that the 
>> output is owned by the entity that requested it.  
>> 
>> > make little/no effort to avoid reproducing large amounts of copyrighted 
>> > code.
>> 
>> How do you go about qualifying that?  Which tools / services are you 
>> referring to?  How do you go about evaluating them?
>> 
>> > If someone submits copyrighted code to the project, whether an AI 
>> > generated it or they just grabbed it from a Google search, it’s on the 
>> > project to try not to accept it.
>> 
>> I am in this camp at the moment, AI vs Human has the same problem for the 
>> reviewer; we are supposed to be doing this, and blocking AI or putting new 
>> rules around AI doesn't really change anything, we are still supposed to do 
>> this work.  
>> 
>> > What would you want?
>> 
>> My vote would be on 2/3 given the list from Ariel.  But I am personally in 
>> the stance that disclosure (which is the ASF policy) is best for the time 
>> being; nothing in this thread has motivated me to change the current policy.
>> 
>> On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin > > wrote:
>> I'm on with the allow list(1) or option 2.  3 just isn't realistic anymore. 
>> 
>> Patrick
>> 
>> 
>> 
>> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe > > wrote:
>> I haven't participated much here, but my vote would be basically #1, i.e. an 
>> "allow list" with a clear procedure for expansion.
>> 
>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg > > wrote:
>> 
>> Hi,
>> 
>> We could, but if the allow list is binding then it's still an allow list 
>> with some guidance on how to expand the allow list.
>> 
>> If it isn't binding then it's guidance so still option 2 really.
>> 
>> I think the key distinction to find some early consensus on if we do a 
>> binding allow list or guidance, and then we can iron out the guidance, but I 
>> think that will be less controversial to work out.
>> 
>> Or option 3 which is not accepting AI generated contributions. I think there 
>> are some with healthy skepticism of AI generated code, but so far I haven't 
>> met anyone who wants to forbid it entirely.
>> 
>> Ariel
>> 
>> On Mon, Jun 16