[DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Mick Semb Wever
What are folks thoughts on accepting a donation of
the spark-cassandra-connector project into the Analytics subproject ?

A number of folks have requested this, stating that they cannot contribute
to the project while it is under DataStax.  The project has largely been in
maintenance mode the past few years.  Under ASF I believe that it will
attract more attention and contributions, and offline discussions I have
had indicate that the spark-cassandra-connector remains an important
complement to the bulk analytics component.


Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Dinesh Joshi
This would be a great contribution to have for the Analytics subproject.
The current bulk functionality in the Analytics subproject complements the
spark-cassandra-connector so I see it as a good fit for donation.

On Mon, Jun 24, 2024 at 12:32 AM Mick Semb Wever  wrote:

>
> What are folks thoughts on accepting a donation of
> the spark-cassandra-connector project into the Analytics subproject ?
>
> A number of folks have requested this, stating that they cannot contribute
> to the project while it is under DataStax.  The project has largely been in
> maintenance mode the past few years.  Under ASF I believe that it will
> attract more attention and contributions, and offline discussions I have
> had indicate that the spark-cassandra-connector remains an important
> complement to the bulk analytics component.
>


Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Jon Haddad
I also think it would be a great contribution, especially since the bulk
analytics library can’t be used by the majority of teams, since it’s hard
coded to only work with single token clusters.



On Mon, Jun 24, 2024 at 9:51 AM Dinesh Joshi  wrote:

> This would be a great contribution to have for the Analytics subproject.
> The current bulk functionality in the Analytics subproject complements the
> spark-cassandra-connector so I see it as a good fit for donation.
>
> On Mon, Jun 24, 2024 at 12:32 AM Mick Semb Wever  wrote:
>
>>
>> What are folks thoughts on accepting a donation of
>> the spark-cassandra-connector project into the Analytics subproject ?
>>
>> A number of folks have requested this, stating that they cannot
>> contribute to the project while it is under DataStax.  The project has
>> largely been in maintenance mode the past few years.  Under ASF I believe
>> that it will attract more attention and contributions, and offline
>> discussions I have had indicate that the spark-cassandra-connector remains
>> an important complement to the bulk analytics component.
>>
>


Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Bernardo Botella
Thanks for the comments Jordan.

Completely agreed that we will need to be careful on not accepting constraints 
that require a read before a write. It is called out on the CEP itself, and 
will have to be enforced in the future.

After all the feedback and discussion, I think we are ready to move to a voting 
thread for CEP-42. I will be posting the thread today.

Thanks everyone who participated in the discussion!
Bernardo

> On Jun 23, 2024, at 2:38 PM, Jordan West  wrote:
> 
> I am generally for this CEP, particularly the sizeOf guardrail. For example, 
> we recently had an incident caused by a client who wrote outside of the 
> contract we had verbally established. The constraint would have let us encode 
> that contract into the database. In this case, clients are writing large 
> blobs at the application layer and internally the client performs chunking.  
> We had established a chunk size of 64k, for example. However, the application 
> team wanted to use a different programming language than the ones we provide 
> clients for so they wrote their own. The new client had a bug that did not 
> honor the agreed upon chunk size and wrote chunks that were MBs in size. This 
> eventually led to a production incident and the issue was discovered as a 
> result of a bunch of analysis (dumping sstables, etc). Had we had the sizeOf 
> guardrail it would have turned a production incident with hours of 
> investigation into a bug found immediately during development. Could this be 
> done with a node-level guardrail? Likely. But config has the issues described 
> above and its possible to have two tables with different constraints around 
> similar fields (for example, two different chunk size configs due to data 
> shape). Could it be done at the client layer? Yes that's what we are doing 
> now, but this incident highlights the weakness with that approach (having to 
> implement the contract everywhere and having disjoint features across 
> clients).
>  
> I also think there is benefit to application owners. Encoding constraints in 
> the database ensures continuity as ownership and contributors change and 
> reduces the need for comments or documentation as the means to enforce or 
> share this knowledge. 
> 
> I think enforcing them at write time makes sense. Thinking about it in the 
> scope of compaction for example reminds me of a data loss incident where 
> someone ran a validation in an older version (like 2.0 or 2.1) and a bunch of 
> 4 byte ints were thrown away because the field expected an 8 byte long. 
> 
> My primary concern would be ensuring that we don't implement constraints that 
> require a read before right (not inList comes to mind as an example of one 
> that could imply reading before writing and could confuse a user if it 
> doesn't). 
> 
> Regarding the conflict with existing guardrails, I do think that is tougher. 
> On one hand I find this feature to be more evolved than those guardrails and 
> would be fine to see them be replaced by it. On the other, the guardrails 
> provide sole control to the operator which is nice but adds some complexity 
> that has been rightly called out.  But I don't see that as a reason not to go 
> forward with this feature. We should pick a path and accept the tradeoffs. 
>   
> Jordan
> 
> 
> On Thu, Jun 13, 2024 at 2:39 PM Bernardo Botella 
> mailto:conta...@bernardobotella.com>> wrote:
>> Thanks a lot for your comments Abe!
>> 
>> I do agree that the Constraint clause should be as simple as possible. I 
>> will add a note on the CEP along with some specifics about the proposed 
>> constraints (removing the ones that are contentious, and adding them to a 
>> possible future additions section). And yeah, I also think that these 
>> constraints will help different Cassandra operating paradigms (multi-tenant 
>> clusters and diverse workflows).
>> 
>> Besides that, I hope that I’ve addressed all the potential concerns and 
>> feedback on the thread. Let’s let a bit more time for others to chime in 
>> (any further feedback will be more than welcome), but I’d like to move 
>> forward with a voting soon if no other concerns are pointed out.
>> 
>> All and all, thanks a lot to everyone that participated in the thread and 
>> added to the discussion!
>> Bernardo
>> 
>> 
>> 
>> > On Jun 12, 2024, at 2:37 PM, Abe Ratnofsky > > > wrote:
>> > 
>> > I've thought about this some more. It would be useful for Cassandra to 
>> > support user-defined "guardrails" (or constraints, whatever you want to 
>> > call them), that could be applied per keyspace or table. Whether a user or 
>> > an operator is considered the owner of a table depends on the organization 
>> > deploying Cassandra, so allowing both parties to protect their tables 
>> > against mis-use seems good to me, especially for large multi-tenant 
>> > clusters with diverse workloads.
>> > 
>> > For example, it would be really useful if a user could set the 
>> > Guardrails.{read,write}C

[VOTE] CEP-42: Constraints Framework

2024-06-24 Thread Bernardo Botella
Hi everyone,

I would like to start the voting for CEP-42.

Proposal: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-42%3A+Constraints+Framework
Discussion: https://lists.apache.org/thread/xc2phmxgsc7t3y9b23079vbflrhyyywj

The vote will be open for 72 hours. A vote passes if there are at least 3 
binding +1s and no binding vetoes.

Thanks,
Bernardo Botella

Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Doug Rohrer
To your point about Guardrails vs. Constraints, I do think the distinct roles 
of “cluster operator” and “application developer” help show how these two 
frameworks are both valuable. I don’t think I’d expect a cluster operator to be 
involved in every table design decision, but being able to set warning and 
error-level guardrails allows an operator to set absolute limits on what the 
database itself accepts. Table-level constraints allow application developers 
(hopefully in concert with operators, where they are two distinct 
people/groups) to add additional, application-layer constraints that are likely 
to be app specific. To restate what I think you were getting at, your example 
of a production issue caused by the development team missing a key verbal 
agreement probably helps illustrate why both table-level constraints and 
guardrails are valuable. 

Imagine that, as an operator, you are generally comfortable with individual 
values in rows being, say, 256k, but because of the way in which this 
particular use case works, 64k chunks needed to be enforced. Your cluster-level 
guardrails could be set at 256k, but the table-level constraints could have 
enforced this 64k chunk size rule.

Doug

> On Jun 23, 2024, at 5:38 PM, Jordan West  wrote:
> 
> I am generally for this CEP, particularly the sizeOf guardrail. For example, 
> we recently had an incident caused by a client who wrote outside of the 
> contract we had verbally established. The constraint would have let us encode 
> that contract into the database. In this case, clients are writing large 
> blobs at the application layer and internally the client performs chunking.  
> We had established a chunk size of 64k, for example. However, the application 
> team wanted to use a different programming language than the ones we provide 
> clients for so they wrote their own. The new client had a bug that did not 
> honor the agreed upon chunk size and wrote chunks that were MBs in size. This 
> eventually led to a production incident and the issue was discovered as a 
> result of a bunch of analysis (dumping sstables, etc). Had we had the sizeOf 
> guardrail it would have turned a production incident with hours of 
> investigation into a bug found immediately during development. Could this be 
> done with a node-level guardrail? Likely. But config has the issues described 
> above and its possible to have two tables with different constraints around 
> similar fields (for example, two different chunk size configs due to data 
> shape). Could it be done at the client layer? Yes that's what we are doing 
> now, but this incident highlights the weakness with that approach (having to 
> implement the contract everywhere and having disjoint features across 
> clients).
>  
> I also think there is benefit to application owners. Encoding constraints in 
> the database ensures continuity as ownership and contributors change and 
> reduces the need for comments or documentation as the means to enforce or 
> share this knowledge. 
> 
> I think enforcing them at write time makes sense. Thinking about it in the 
> scope of compaction for example reminds me of a data loss incident where 
> someone ran a validation in an older version (like 2.0 or 2.1) and a bunch of 
> 4 byte ints were thrown away because the field expected an 8 byte long. 
> 
> My primary concern would be ensuring that we don't implement constraints that 
> require a read before right (not inList comes to mind as an example of one 
> that could imply reading before writing and could confuse a user if it 
> doesn't). 
> 
> Regarding the conflict with existing guardrails, I do think that is tougher. 
> On one hand I find this feature to be more evolved than those guardrails and 
> would be fine to see them be replaced by it. On the other, the guardrails 
> provide sole control to the operator which is nice but adds some complexity 
> that has been rightly called out.  But I don't see that as a reason not to go 
> forward with this feature. We should pick a path and accept the tradeoffs. 
>   
> Jordan
> 
> 
> On Thu, Jun 13, 2024 at 2:39 PM Bernardo Botella 
> mailto:conta...@bernardobotella.com>> wrote:
>> Thanks a lot for your comments Abe!
>> 
>> I do agree that the Constraint clause should be as simple as possible. I 
>> will add a note on the CEP along with some specifics about the proposed 
>> constraints (removing the ones that are contentious, and adding them to a 
>> possible future additions section). And yeah, I also think that these 
>> constraints will help different Cassandra operating paradigms (multi-tenant 
>> clusters and diverse workflows).
>> 
>> Besides that, I hope that I’ve addressed all the potential concerns and 
>> feedback on the thread. Let’s let a bit more time for others to chime in 
>> (any further feedback will be more than welcome), but I’d like to move 
>> forward with a voting soon if no other concerns are pointed out.
>> 
>> All and all, thanks a lot to everyo

Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Jon Haddad
I love where this is going. I have one question , however. I think it would
be more consistent if these were table level guardrails.  Is there anything
that prevents us from utilizing the same underlying system and terminology
for both the node level guardrails and the table ones?

If we can avoid duplicate concepts we should.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Mon, Jun 24, 2024 at 4:19 PM Doug Rohrer  wrote:

> To your point about Guardrails vs. Constraints, I do think the distinct
> roles of “cluster operator” and “application developer” help show how these
> two frameworks are both valuable. I don’t think I’d expect a cluster
> operator to be involved in every table design decision, but being able to
> set warning and error-level guardrails allows an operator to set absolute
> limits on what the database itself accepts. Table-level constraints allow
> application developers (hopefully in concert with operators, where they are
> two distinct people/groups) to add *additional*, application-layer
> constraints that are likely to be app specific. To restate what I think you
> were getting at, your example of a production issue caused by the
> development team missing a key verbal agreement probably helps illustrate
> why both table-level constraints *and* guardrails are valuable.
>
> Imagine that, as an operator, you are *generally* comfortable with
> individual values in rows being, say, 256k, but because of the way in which
> this *particular* use case works, 64k chunks needed to be enforced. Your
> cluster-level *guardrails* could be set at 256k, but the table-level
> *constraints* could have enforced this 64k chunk size rule.
>
> Doug
>
> On Jun 23, 2024, at 5:38 PM, Jordan West  wrote:
>
> I am generally for this CEP, particularly the sizeOf guardrail. For
> example, we recently had an incident caused by a client who wrote outside
> of the contract we had verbally established. The constraint would have let
> us encode that contract into the database. In this case, clients are
> writing large blobs at the application layer and internally the client
> performs chunking.  We had established a chunk size of 64k, for example.
> However, the application team wanted to use a different programming
> language than the ones we provide clients for so they wrote their own. The
> new client had a bug that did not honor the agreed upon chunk size and
> wrote chunks that were MBs in size. This eventually led to a production
> incident and the issue was discovered as a result of a bunch of analysis
> (dumping sstables, etc). Had we had the sizeOf guardrail it would have
> turned a production incident with hours of investigation into a bug found
> immediately during development. Could this be done with a node-level
> guardrail? Likely. But config has the issues described above and its
> possible to have two tables with different constraints around similar
> fields (for example, two different chunk size configs due to data shape).
> Could it be done at the client layer? Yes that's what we are doing now, but
> this incident highlights the weakness with that approach (having to
> implement the contract everywhere and having disjoint features across
> clients).
>
> I also think there is benefit to application owners. Encoding constraints
> in the database ensures continuity as ownership and contributors change and
> reduces the need for comments or documentation as the means to enforce or
> share this knowledge.
>
> I think enforcing them at write time makes sense. Thinking about it in the
> scope of compaction for example reminds me of a data loss incident where
> someone ran a validation in an older version (like 2.0 or 2.1) and a bunch
> of 4 byte ints were thrown away because the field expected an 8 byte long.
>
> My primary concern would be ensuring that we don't implement constraints
> that require a read before right (not inList comes to mind as an example of
> one that could imply reading before writing and could confuse a user if it
> doesn't).
>
> Regarding the conflict with existing guardrails, I do think that is
> tougher. On one hand I find this feature to be more evolved than those
> guardrails and would be fine to see them be replaced by it. On the other,
> the guardrails provide sole control to the operator which is nice but adds
> some complexity that has been rightly called out.  But I don't see that as
> a reason not to go forward with this feature. We should pick a path and
> accept the tradeoffs.
>
> Jordan
>
>
> On Thu, Jun 13, 2024 at 2:39 PM Bernardo Botella <
> conta...@bernardobotella.com> wrote:
>
>> Thanks a lot for your comments Abe!
>>
>> I do agree that the Constraint clause should be as simple as possible. I
>> will add a note on the CEP along with some specifics about the proposed
>> constraints (removing the ones that are contentious, and adding them to a
>> possible future additions section). And yeah, I also think that these
>> constraints will help differe

Re: [DISCUSS] Increments on non-existent rows in Accord

2024-06-24 Thread Ariel Weisberg
Hi,

I think the current behavior maps to SQL more than CQL. In SQL an update 
doesn't generate an error if the row to be updating doesn't exist it just 
return 0 rows updated. 

If someone wanted an upsert or increment behavior in their transaction could 
they accomplish it with the current transaction CQL at all?

We could support a more optimal syntax later, but I suspect that with our one 
shot behavior it would get mixed up by multiple attempts to insert if not 
exists and then update the same row to achieve upsert.

Ariel
On Thu, Jun 20, 2024, at 4:54 PM, Caleb Rackliffe wrote:
> We had a bug report a while back from Luis E Fernandez and team in 
> CASSANDRA-18988  
> around the behavior of increments/decrements on numeric fields for 
> non-existent rows. Consider the following, wich can be run on the 
> cep-15-accord branch:
> 
> CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '1'} AND durable_writes = true
> 
> CREATE TABLE accord.accounts (
> partition text,
> account_id int,
> balance int,
> PRIMARY KEY (partition, account_id)
> ) WITH CLUSTERING ORDER BY (account_id ASC) AND transactional_mode='full'
> 
> BEGIN TRANSACTION
> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
> ('default', 0, 100);
> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
> ('default', 1, 100);
> COMMIT TRANSACTION
> 
> BEGIN TRANSACTION
> UPDATE accord.accounts SET balance -= 10 WHERE partition = 'default' AND 
> account_id = 1;
> UPDATE accord.accounts SET balance += 10 WHERE partition = 'default' AND 
> account_id = 3;
> COMMIT TRANSACTION
> 
> Reading the 'default' partition will produce the following result.
> 
>  partition | account_id | balance
> ---++-
>default |  0 | 100
>default |  1 |  90
> 
> As you will notice, we have not implicitly inserted a row for account_id 3, 
> which does not exist when we request that its balance be incremented by 10. 
> This is by design, as null + 10 == null.
> 
> Before I close CASSANDRA-18988 
> , *I'd like to confirm 
> with everyone reading this that the behavior above is reasonable*. The only 
> other option I've seen proposed that would make sense is perhaps producing a 
> result like:
> 
>  partition | account_id | balance
> ---++-
>default |  0 | 100
>default |  1 |  90
>default |  3 |null
> 
> Note however that this is exactly what we would produce if we had first 
> inserted a row w/ no value for balance:
> 
> INSERT INTO accord.accounts (partition, account_id) VALUES ('default', 3);


Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread C. Scott Andreas

Supportive of accepting a donation of the Spark Cassandra Connector under the project's umbrella 
as well - I think that would be very welcome and appreciated. Spark Cassandra Connector and the 
Analytics library are also suited to slightly different usage patterns. SCC can be a good fit for 
Spark jobs that operate with a high degree of selectivity; vs. larger bulk scoops. – Scott On Jun 
24, 2024, at 1:29 AM, Jon Haddad  wrote: I also think it would be a 
great contribution, especially since the bulk analytics library can’t be used by the majority of 
teams, since it’s hard coded to only work with single token clusters. On Mon, Jun 24, 2024 at 
9:51 AM Dinesh Joshi < djo...@apache.org > wrote: This would be a great contribution to 
have for the Analytics subproject. The current bulk functionality in the Analytics subproject 
complements the spark-cassandra-connector so I see it as a good fit for donation. On Mon, Jun 24, 
2024 at 12:32 AM Mick Semb Wever < m...@apache.org > wrote: What are folks thoughts on 
accepting a donation of the spark-cassandra-connector project into the Analytics subproject ? A 
number of folks have requested this, stating that they cannot contribute to the project while it 
is under DataStax. The project has largely been in maintenance mode the past few years. Under ASF 
I believe that it will attract more attention and contributions, and offline discussions I have 
had indicate that the spark-cassandra-connector remains an important complement to the bulk 
analytics component.

Re: [DISCUSS] Increments on non-existent rows in Accord

2024-06-24 Thread Caleb Rackliffe
It sounds like the best course of action for now would be to keep the
current behavior.

However, we might want to fold this into CASSANDRA-18107 as a specific
concern around what we return when an explicit SELECT isn't present in the
transaction.

i.e. For any update, we'll have something like (courtesy of David) UPDATED,
SKIPPED (condition was met but couldn't update a non-existent row), or
CONDITION_NOT_MET


On Mon, Jun 24, 2024 at 11:42 AM Ariel Weisberg  wrote:

> Hi,
>
> I think the current behavior maps to SQL more than CQL. In SQL an update
> doesn't generate an error if the row to be updating doesn't exist it just
> return 0 rows updated.
>
> If someone wanted an upsert or increment behavior in their transaction
> could they accomplish it with the current transaction CQL at all?
>
> We could support a more optimal syntax later, but I suspect that with our
> one shot behavior it would get mixed up by multiple attempts to insert if
> not exists and then update the same row to achieve upsert.
>
> Ariel
> On Thu, Jun 20, 2024, at 4:54 PM, Caleb Rackliffe wrote:
>
> We had a bug report a while back from Luis E Fernandez and team in
> CASSANDRA-18988 
> around the behavior of increments/decrements on numeric fields for
> non-existent rows. Consider the following, wich can be run on the
> cep-15-accord branch:
>
> CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '1'} AND durable_writes = true
>
>
> CREATE TABLE accord.accounts (
> partition text,
> account_id int,
> balance int,
> PRIMARY KEY (partition, account_id)
> ) WITH CLUSTERING ORDER BY (account_id ASC) AND transactional_mode='full'
>
>
> BEGIN TRANSACTION
> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
> ('default', 0, 100);
> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
> ('default', 1, 100);
> COMMIT TRANSACTION
>
>
> BEGIN TRANSACTION
> UPDATE accord.accounts SET balance -= 10 WHERE partition = 'default' AND 
> account_id = 1;
> UPDATE accord.accounts SET balance += 10 WHERE partition = 'default' AND 
> account_id = 3;
> COMMIT TRANSACTION
>
>
> Reading the 'default' partition will produce the following result.
>
>
>  partition | account_id | balance
> ---++-
>default |  0 | 100
>default |  1 |  90
>
>
> As you will notice, we have not implicitly inserted a row for account_id 3, 
> which does not exist when we request that its balance be incremented by 10. 
> This is by design, as null + 10 == null.
>
>
> Before I close CASSANDRA-18988 
> , *I'd like to confirm 
> with everyone reading this that the behavior above is reasonable*. The only 
> other option I've seen proposed that would make sense is perhaps producing a 
> result like:
>
>
>  partition | account_id | balance
> ---++-
>default |  0 | 100
>default |  1 |  90
>
>default |  3 |null
>
>
> Note however that this is exactly what we would produce if we had first 
> inserted a row w/ no value for balance:
>
>
> INSERT INTO accord.accounts (partition, account_id) VALUES ('default', 3);
>
>
>


Re: [DISCUSS] Increments on non-existent rows in Accord

2024-06-24 Thread Ariel Weisberg
Hi,

SGTM. It's not just what we return though it's also supporting UPSERT for RMR 
updates? Because our transactions are one shot I don't think you could do that 
because the statement that does INSERT IF NOT EXIST would not generate a row 
that is visible to a later UPDATE statement in the same transaction that 
increments the row.

We might also have a restriction somewhere that limits us to one update per 
clustering.

Ariel
On Mon, Jun 24, 2024, at 1:30 PM, Caleb Rackliffe wrote:
> It sounds like the best course of action for now would be to keep the current 
> behavior.
> 
> However, we might want to fold this into CASSANDRA-18107 as a specific 
> concern around what we return when an explicit SELECT isn't present in the 
> transaction.
> 
> i.e. For any update, we'll have something like (courtesy of David) UPDATED, 
> SKIPPED (condition was met but couldn't update a non-existent row), or 
> CONDITION_NOT_MET
> 
> 
> On Mon, Jun 24, 2024 at 11:42 AM Ariel Weisberg  wrote:
>> __
>> Hi,
>> 
>> I think the current behavior maps to SQL more than CQL. In SQL an update 
>> doesn't generate an error if the row to be updating doesn't exist it just 
>> return 0 rows updated.
>> 
>> If someone wanted an upsert or increment behavior in their transaction could 
>> they accomplish it with the current transaction CQL at all?
>> 
>> We could support a more optimal syntax later, but I suspect that with our 
>> one shot behavior it would get mixed up by multiple attempts to insert if 
>> not exists and then update the same row to achieve upsert.
>> 
>> Ariel
>> On Thu, Jun 20, 2024, at 4:54 PM, Caleb Rackliffe wrote:
>>> We had a bug report a while back from Luis E Fernandez and team in 
>>> CASSANDRA-18988  
>>> around the behavior of increments/decrements on numeric fields for 
>>> non-existent rows. Consider the following, wich can be run on the 
>>> cep-15-accord branch:
>>> 
>>> CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', 
>>> 'replication_factor': '1'} AND durable_writes = true
>>> 
>>> CREATE TABLE accord.accounts (
>>> partition text,
>>> account_id int,
>>> balance int,
>>> PRIMARY KEY (partition, account_id)
>>> ) WITH CLUSTERING ORDER BY (account_id ASC) AND transactional_mode='full'
>>> 
>>> BEGIN TRANSACTION
>>> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
>>> ('default', 0, 100);
>>> INSERT INTO accord.accounts (partition, account_id, balance) VALUES 
>>> ('default', 1, 100);
>>> COMMIT TRANSACTION
>>> 
>>> BEGIN TRANSACTION
>>> UPDATE accord.accounts SET balance -= 10 WHERE partition = 'default' 
>>> AND account_id = 1;
>>> UPDATE accord.accounts SET balance += 10 WHERE partition = 'default' 
>>> AND account_id = 3;
>>> COMMIT TRANSACTION
>>> 
>>> Reading the 'default' partition will produce the following result.
>>> 
>>>  partition | account_id | balance
>>> ---++-
>>>default |  0 | 100
>>>default |  1 |  90
>>> 
>>> As you will notice, we have not implicitly inserted a row for account_id 3, 
>>> which does not exist when we request that its balance be incremented by 10. 
>>> This is by design, as null + 10 == null.
>>> 
>>> Before I close CASSANDRA-18988 
>>> , *I'd like to 
>>> confirm with everyone reading this that the behavior above is reasonable*. 
>>> The only other option I've seen proposed that would make sense is perhaps 
>>> producing a result like:
>>> 
>>>  partition | account_id | balance
>>> ---++-
>>>default |  0 | 100
>>>default |  1 |  90
>>>default |  3 |null
>>> 
>>> Note however that this is exactly what we would produce if we had first 
>>> inserted a row w/ no value for balance:
>>> 
>>> INSERT INTO accord.accounts (partition, account_id) VALUES ('default', 3);
>> 


Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Ariel Weisberg
Hi,

I see a vote for this has been called. I should have provided more prompt 
feedback sooner.

I am a strong +1 on adding column level constraints being a good thing to add. 
I'm not too concerned about row/partition/table level constraints, but I would 
like to change the syntax before I would be +1 on this CEP.

It would be good to align the syntax as closely as possible to our existing 
syntax, and if not that then MySQL/Postgres. For example it looks like we don't 
have a string length function so maybe add `LENGTH` (consistent with 
MySQL/Postgres) to also use with column level constraints.

It looks like there are generally two forms of constraint syntax, one is 
expressed as part of the column definition, and the other is a named or 
anonymous constraint on the table. https://www.w3schools.com/sql/sql_check.asp

Can we align with having these column level ones as `CHECK` constraints like in 
SQL, and `CONSTRAINT [constraint_name] CHECK` would be used if creating a named 
or multi-column constraint?

Will column level check constraints support `AND` so that you can specify 
multiple constraints on the column? I am not sure if that is supported in other 
databases, but it would be good to align on that as well.

RE some implementation things to keep in mind:

If TCM is in use and the constraints are defined in the schema data structure 
this should work fine with Accord because all coordinators (regular, recovery) 
will deterministically agree on the constraints being enforced BUT... this also 
has to map to how/when constraints are enforced.

Both Accord and Paxos work best when the constraints are enforced when the 
final mutation to be applied is created and not later when it is being applied 
to the CFS. This also reduces duplication of enforcement checking work to just 
the coordinator for the write.

Ariel

On Fri, May 31, 2024, at 5:23 PM, Bernardo Botella wrote:
> Hello everyone,
> 
> I am proposing this CEP:
> CEP-42: Constraints Framework - CASSANDRA - Apache Software Foundation 
> 
> cwiki.apache.org 
> 
> favicon.ico 
> 
> 
> And I’m looking for feedback from the community.
> 
> Thanks a lot!
> Bernardo


Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Francisco Guerrero
Yeah, having the connector will enhance the Cassandra ecosystem. I'm looking 
forward to this contribution.

On 2024/06/24 17:28:48 "C. Scott Andreas" wrote:
> Supportive of accepting a donation of the Spark Cassandra Connector under the 
> project's umbrella as well - I think that would be very welcome and 
> appreciated. Spark Cassandra Connector and the Analytics library are also 
> suited to slightly different usage patterns. SCC can be a good fit for Spark 
> jobs that operate with a high degree of selectivity; vs. larger bulk scoops. 
> – Scott On Jun 24, 2024, at 1:29 AM, Jon Haddad  wrote: I 
> also think it would be a great contribution, especially since the bulk 
> analytics library can’t be used by the majority of teams, since it’s hard 
> coded to only work with single token clusters. On Mon, Jun 24, 2024 at 9:51 
> AM Dinesh Joshi < djo...@apache.org > wrote: This would be a great 
> contribution to have for the Analytics subproject. The current bulk 
> functionality in the Analytics subproject complements the 
> spark-cassandra-connector so I see it as a good fit for donation. On Mon, Jun 
> 24, 2024 at 12:32 AM Mick Semb Wever < m...@apache.org > wrote: What are 
> folks thoughts on accepting a donation of the spark-cassandra-connector 
> project into the Analytics subproject ? A number of folks have requested 
> this, stating that they cannot contribute to the project while it is under 
> DataStax. The project has largely been in maintenance mode the past few 
> years. Under ASF I believe that it will attract more attention and 
> contributions, and offline discussions I have had indicate that the 
> spark-cassandra-connector remains an important complement to the bulk 
> analytics component.


Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Abe Ratnofsky
Likewise - another vote in favor of bringing in this subproject.

Any thoughts on bringing in dsbulk as well? dsbulk has a lower barrier to entry 
than Spark Cassandra Connector, addresses a real need for users, and appears to 
be at a similar place in its project lifecycle.

Abe

> On Jun 24, 2024, at 4:36 PM, Francisco Guerrero  wrote:
> 
> Yeah, having the connector will enhance the Cassandra ecosystem. I'm looking 
> forward to this contribution.
> 
> On 2024/06/24 17:28:48 "C. Scott Andreas" wrote:
>> Supportive of accepting a donation of the Spark Cassandra Connector under 
>> the project's umbrella as well - I think that would be very welcome and 
>> appreciated. Spark Cassandra Connector and the Analytics library are also 
>> suited to slightly different usage patterns. SCC can be a good fit for Spark 
>> jobs that operate with a high degree of selectivity; vs. larger bulk scoops. 
>> – Scott On Jun 24, 2024, at 1:29 AM, Jon Haddad  wrote: 
>> I also think it would be a great contribution, especially since the bulk 
>> analytics library can’t be used by the majority of teams, since it’s hard 
>> coded to only work with single token clusters. On Mon, Jun 24, 2024 at 9:51 
>> AM Dinesh Joshi < djo...@apache.org > wrote: This would be a great 
>> contribution to have for the Analytics subproject. The current bulk 
>> functionality in the Analytics subproject complements the 
>> spark-cassandra-connector so I see it as a good fit for donation. On Mon, 
>> Jun 24, 2024 at 12:32 AM Mick Semb Wever < m...@apache.org > wrote: What are 
>> folks thoughts on accepting a donation of the spark-cassandra-connector 
>> project into the Analytics subproject ? A number of folks have requested 
>> this, stating that they cannot contribute to the project while it is under 
>> DataStax. The project has largely been in maintenance mode the past few 
>> years. Under ASF I believe that it will attract more attention and 
>> contributions, and offline discussions I have had indicate that the 
>> spark-cassandra-connector remains an important complement to the bulk 
>> analytics component.



Re: [DISCUSS] spark-cassandra-connector donation to Analytics subproject

2024-06-24 Thread Jeremy Hanna
+1 nb.  I too see these tools (bulk analytics and scc) as complementary as has 
been said.  SCC also does some nice things to support Spark Streaming that I 
don't think are addressed by the bulk analytics subproject today.

Regarding dsbulk, I think that's another thread but it's something we're 
looking at as well.  It has a lower barrier to entry for sure, but it doesn't 
plug into the full Spark ecosystem for those that need it.

> On Jun 24, 2024, at 3:40 PM, Abe Ratnofsky  wrote:
> 
> Likewise - another vote in favor of bringing in this subproject.
> 
> Any thoughts on bringing in dsbulk as well? dsbulk has a lower barrier to 
> entry than Spark Cassandra Connector, addresses a real need for users, and 
> appears to be at a similar place in its project lifecycle.
> 
> Abe
> 
>> On Jun 24, 2024, at 4:36 PM, Francisco Guerrero  wrote:
>> 
>> Yeah, having the connector will enhance the Cassandra ecosystem. I'm looking 
>> forward to this contribution.
>> 
>> On 2024/06/24 17:28:48 "C. Scott Andreas" wrote:
>>> Supportive of accepting a donation of the Spark Cassandra Connector under 
>>> the project's umbrella as well - I think that would be very welcome and 
>>> appreciated. Spark Cassandra Connector and the Analytics library are also 
>>> suited to slightly different usage patterns. SCC can be a good fit for 
>>> Spark jobs that operate with a high degree of selectivity; vs. larger bulk 
>>> scoops. – Scott On Jun 24, 2024, at 1:29 AM, Jon Haddad 
>>>  wrote: I also think it would be a great contribution, 
>>> especially since the bulk analytics library can’t be used by the majority 
>>> of teams, since it’s hard coded to only work with single token clusters. On 
>>> Mon, Jun 24, 2024 at 9:51 AM Dinesh Joshi < djo...@apache.org > wrote: This 
>>> would be a great contribution to have for the Analytics subproject. The 
>>> current bulk functionality in the Analytics subproject complements the 
>>> spark-cassandra-connector so I see it as a good fit for donation. On Mon, 
>>> Jun 24, 2024 at 12:32 AM Mick Semb Wever < m...@apache.org > wrote: What 
>>> are folks thoughts on accepting a donation of the spark-cassandra-connector 
>>> project into the Analytics subproject ? A number of folks have requested 
>>> this, stating that they cannot contribute to the project while it is under 
>>> DataStax. The project has largely been in maintenance mode the past few 
>>> years. Under ASF I believe that it will attract more attention and 
>>> contributions, and offline discussions I have had indicate that the 
>>> spark-cassandra-connector remains an important complement to the bulk 
>>> analytics component.
> 



Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Bernardo Botella
Hi Ariel and Jon,

Let me address your question first. Yes, AND is supported in the proposal. 
Below you can find some examples of different constraints applied to the same 
column.

As per the LENGTH name instead of sizeOf as in the proposal, I am also not 
opposed to it if it is more consistent with terminology in the databases 
universe.

So, to recap, there seems to be general agreement on the usefulness of the 
Constraints Framework.
Now, from the feedback that has arrived after the voting has been called, I see 
there are three different proposals for syntax:

1.-
The syntax currently described in the CEP. Example:
CREATE TYPE keyspace.cidr_address_ipv4 (
  ip_adress inet,
  subnet_mask int,
  CONSTRAINT subnet_mask > 0,
  CONSTRAINT subnet_mask < 32
)

2.-
As Jon suggested, leaving this definitions to more specific Guardrails at table 
level. Example, something like:
column_min_int_value_size_threshold_keyspace_address_ipv4_ip_adress = 0
column_max_int_value_size_threshold_keyspace_address_ipv4_ip_adress = 32

3.-
As Ariel suggested, having the CHECK keyword added to align consistency with 
SQL. Example:
CREATE TYPE keyspace.cidr_address_ipv4 (
  ip_adress inet,
  subnet_mask int,
  CONSTRAINT CHECK subnet_mask > 0,
  CONSTRAINT CHECK subnet_mask < 32
)

For the guardrails vs cql syntax, I think that keeping the conceptual 
separation that has been explored in this thread, and perfectly recapped by 
Doug, is closer to what we are trying to achieve with this framework. In my 
opinion, having them in the CQL schema definition provides those application 
level constraints that Doug mentions in an more accesible way than having to 
configure such specific guardrais.

For the addition of the CHECK keyword, I'm definitely not opposed to it if it 
helps Cassandra users coming from other databases understand concepts that were 
already familiar to them.

I hope this helps move the conversation forward,
Bernardo



> On Jun 24, 2024, at 12:17 PM, Ariel Weisberg  wrote:
> 
> Hi,
> 
> I see a vote for this has been called. I should have provided more prompt 
> feedback sooner.
> 
> I am a strong +1 on adding column level constraints being a good thing to 
> add. I'm not too concerned about row/partition/table level constraints, but I 
> would like to change the syntax before I would be +1 on this CEP.
> 
> It would be good to align the syntax as closely as possible to our existing 
> syntax, and if not that then MySQL/Postgres. For example it looks like we 
> don't have a string length function so maybe add `LENGTH` (consistent with 
> MySQL/Postgres) to also use with column level constraints.
> 
> It looks like there are generally two forms of constraint syntax, one is 
> expressed as part of the column definition, and the other is a named or 
> anonymous constraint on the table. https://www.w3schools.com/sql/sql_check.asp
> 
> Can we align with having these column level ones as `CHECK` constraints like 
> in SQL, and `CONSTRAINT [constraint_name] CHECK` would be used if creating a 
> named or multi-column constraint?
> 
> Will column level check constraints support `AND` so that you can specify 
> multiple constraints on the column? I am not sure if that is supported in 
> other databases, but it would be good to align on that as well.
> 
> RE some implementation things to keep in mind:
> 
> If TCM is in use and the constraints are defined in the schema data structure 
> this should work fine with Accord because all coordinators (regular, 
> recovery) will deterministically agree on the constraints being enforced 
> BUT... this also has to map to how/when constraints are enforced.
> 
> Both Accord and Paxos work best when the constraints are enforced when the 
> final mutation to be applied is created and not later when it is being 
> applied to the CFS. This also reduces duplication of enforcement checking 
> work to just the coordinator for the write.
> 
> Ariel
> 
> On Fri, May 31, 2024, at 5:23 PM, Bernardo Botella wrote:
>> Hello everyone,
>> 
>> I am proposing this CEP:
>> CEP-42: Constraints Framework - CASSANDRA - Apache Software Foundation 
>> 
>> cwiki.apache.org 
>> 
>>  
>> 
>> 
>> And I’m looking for feedback from the community.
>> 
>> Thanks a lot!
>> Bernardo



Re: [DISCUSS] CEP-42: Constraints Framework

2024-06-24 Thread Jon Haddad
I think my suggestion was unclear. I was referring to the name guardrail,
using the same infra as guardrails, rather than a separate concept. Not
applying it like we do table options.



On Tue, Jun 25, 2024 at 12:44 AM Bernardo Botella <
conta...@bernardobotella.com> wrote:

> Hi Ariel and Jon,
>
> Let me address your question first. Yes, AND is supported in the proposal.
> Below you can find some examples of different constraints applied to the
> same column.
>
> As per the LENGTH name instead of sizeOf as in the proposal, I am also not
> opposed to it if it is more consistent with terminology in the databases
> universe.
>
> So, to recap, there seems to be general agreement on the usefulness of the
> Constraints Framework.
> Now, from the feedback that has arrived after the voting has been called,
> I see there are three different proposals for syntax:
>
> 1.-
> The syntax currently described in the CEP. Example:
> CREATE TYPE keyspace.cidr_address_ipv4 (
>   ip_adress inet,
>   subnet_mask int,
>   CONSTRAINT subnet_mask > 0,
>   CONSTRAINT subnet_mask < 32
> )
>
> 2.-
> As Jon suggested, leaving this definitions to more specific Guardrails at
> table level. Example, something like:
> column_min_int_value_size_threshold_keyspace_address_ipv4_ip_adress = 0
> column_max_int_value_size_threshold_keyspace_address_ipv4_ip_adress = 32
>
> 3.-
> As Ariel suggested, having the CHECK keyword added to align consistency
> with SQL. Example:
> CREATE TYPE keyspace.cidr_address_ipv4 (
>   ip_adress inet,
>   subnet_mask int,
>   CONSTRAINT CHECK subnet_mask > 0,
>   CONSTRAINT CHECK subnet_mask < 32
> )
>
> For the guardrails vs cql syntax, I think that keeping the conceptual
> separation that has been explored in this thread, and perfectly recapped by
> Doug, is closer to what we are trying to achieve with this framework. In my
> opinion, having them in the CQL schema definition provides those
> application level constraints that Doug mentions in an more accesible way
> than having to configure such specific guardrais.
>
> For the addition of the CHECK keyword, I'm definitely not opposed to it if
> it helps Cassandra users coming from other databases understand concepts
> that were already familiar to them.
>
> I hope this helps move the conversation forward,
> Bernardo
>
>
>
> On Jun 24, 2024, at 12:17 PM, Ariel Weisberg  wrote:
>
> Hi,
>
> I see a vote for this has been called. I should have provided more prompt
> feedback sooner.
>
> I am a strong +1 on adding column level constraints being a good thing to
> add. I'm not too concerned about row/partition/table level constraints, but
> I would like to change the syntax before I would be +1 on this CEP.
>
> It would be good to align the syntax as closely as possible to our
> existing syntax, and if not that then MySQL/Postgres. For example it looks
> like we don't have a string length function so maybe add `LENGTH`
> (consistent with MySQL/Postgres) to also use with column level constraints.
>
> It looks like there are generally two forms of constraint syntax, one is
> expressed as part of the column definition, and the other is a named or
> anonymous constraint on the table.
> https://www.w3schools.com/sql/sql_check.asp
>
> Can we align with having these column level ones as `CHECK` constraints
> like in SQL, and `CONSTRAINT [constraint_name] CHECK` would be used if
> creating a named or multi-column constraint?
>
> Will column level check constraints support `AND` so that you can specify
> multiple constraints on the column? I am not sure if that is supported in
> other databases, but it would be good to align on that as well.
>
> RE some implementation things to keep in mind:
>
> If TCM is in use and the constraints are defined in the schema data
> structure this should work fine with Accord because all coordinators
> (regular, recovery) will deterministically agree on the constraints being
> enforced BUT... this also has to map to how/when constraints are enforced.
>
> Both Accord and Paxos work best when the constraints are enforced when the
> final mutation to be applied is created and not later when it is being
> applied to the CFS. This also reduces duplication of enforcement checking
> work to just the coordinator for the write.
>
> Ariel
>
> On Fri, May 31, 2024, at 5:23 PM, Bernardo Botella wrote:
>
> Hello everyone,
>
> I am proposing this CEP:
> CEP-42: Constraints Framework - CASSANDRA - Apache Software Foundation
> 
> cwiki.apache.org
> 
> 
> 
>
>
> And I’m looking for feedback from the community.
>
> Thanks a lot!
> Bernardo
>
>
>