Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-14 Thread Sam Tunnicliffe
+1

> On 13 Jun 2023, at 15:14, Jeremy Hanna  wrote:
> 
> Calling for a vote on CEP-8 [1].
> 
> To clarify the intent, as Benjamin said in the discussion thread [2], the 
> goal of this vote is simply to ensure that the community is in favor of the 
> donation. Nothing more.
> The plan is to introduce the drivers, one by one. Each driver donation will 
> need to be accepted first by the PMC members, as it is the case for any 
> donation. Therefore the PMC should have full control on the pace at which new 
> drivers are accepted.
> 
> If this vote passes, we can start this process for the Java driver under the 
> direction of the PMC.
> 
> Jeremy
> 
> 1. 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp



Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-14 Thread Jorge Bay Gondra
+1 nb

On Wed, Jun 14, 2023 at 9:13 AM Sam Tunnicliffe  wrote:

> +1
>
> On 13 Jun 2023, at 15:14, Jeremy Hanna  wrote:
>
> Calling for a vote on CEP-8 [1].
>
> To clarify the intent, as Benjamin said in the discussion thread [2], the
> goal of this vote is simply to ensure that the community is in favor of
> the donation. Nothing more.
> The plan is to introduce the drivers, one by one. Each driver donation
> will need to be accepted first by the PMC members, as it is the case for
> any donation. Therefore the PMC should have full control on the pace at
> which new drivers are accepted.
>
> If this vote passes, we can start this process for the Java driver under
> the direction of the PMC.
>
> Jeremy
>
> 1.
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp
>
>
>


Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-14 Thread Andrés de la Peña
>
> > Default value I agree with you; features should be off by default!  If
> we remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior


I'm glad we agree on at least removing the default value if we keep the
deprecated properties.

> With that, I kinda don’t agree that including system tables is a mistake,
> as we add more we allow less for user tables before we start to have
> issues….


That's problematic because the new thresholds we added in CASSANDRA-17147
don't include system tables. Do you think we should change that?

I still think it's better not to include the system tables in the count.
The thresholds on the number of keyspaces/tables/rows/columns/tombstones
are just guidance since they cannot be exactly related to exact resource
consumption. The main purpose of those thresholds is to prevent obvious
antipatterns such as creating thousands of tables. A benefit of expressing
the guardrails in terms of the number of schema entities, rather than
counting the memory usage of those entities, is that they are easy to
understand and reason about. In my opinion including system tables defeats
that purpose because it forces users to know details about the system
tables. The fact that those details change between versions doesn't help.
Including system tables is not going to make the thresholds precise in
terms of measuring memory consumption because that depends on other
factors, such as the columns they store.

Including system tables also imposes a minimum threshold value, like in 5.0
you cannot set a threshold value under 45 tables without triggering it with
an empty db. For other thresholds, this can be more tricky. That would be
the case of the guardrail on the number of columns in a partition, where
you would need to know the size of the widest row in the system tables,
which can change over time.

I guess that if system tables were to be counted, a recommendation for the
threshold would say something like "It's recommended not to have more than
150 tables. The system already includes 45 tables for internal usage, so
you shouldn't create more than 105 user tables". I find it's better for
usability to not count the system tables and just say "It's recommended not
to have more than 100 tables. This doesn't include system tables."

On Tue, 13 Jun 2023 at 23:51, Josh McKenzie  wrote:

> Warning that too many tables (including system) may have negative behavior
> I think is fine
>
> This reminds me of the current situation with our tests where we just keep
> adding more and more without really considering the value of the current
> set and the costs of that body of work as it keeps growing.
>
> Having some kind of signal that we need to do some housekeeping with our
> system tables, or *something* in the feedback loop that helps us keep on
> top of this hygiene over time, seems like a clear benefit to me.
>
> On Tue, Jun 13, 2023, at 1:42 PM, David Capwell wrote:
>
> I think that the combined decision of using a default value and counting
> system tables was a mistake
>
>
> Default value I agree with you; features should be off by default!  If we
> remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior
>
> As for system tables… each table adds a cost to our bookkeeping, so when
> we add new tables the cost grows and the memory per table decreases, does
> it not?  Warning that too many tables (including system) may have negative
> behavior I think is fine, its only if we start to fail is when things
> become a problem (upgrading to 5.0 can’t happen due to too many tables
> added in the release?); think the feature was warn only, so that should be
> fine.  With that, I kinda don’t agree that including system tables is a
> mistake, as we add more we allow less for user tables before we start to
> have issues…. At the same time, if we have improvements in newer versions
> that allows higher number of tables, the user then has to update their
> configs (well, as long as we don’t make things worse a smaller limit than
> needed is fine…)
>
> we would need to know how many system keyspaces/tables were on the version
> we are upgrading from
>
>
> Do we?  The logic was pulling from local schema, so to keep the same
> behavior we would need to do the same; being version dependent would
> actually break the semantics as far as I can tell.
>
> On Jun 13, 2023, at 9:50 AM, Andrés de la Peña 
> wrote:
>
> Indeed "keyspace_count_warn_threshold" and "table_count_warn_threshold"
> include system keyspaces and tables. Also, differently to the newer
> guardrails, they are enabled by default.
>
> I find that problematic because users need to know how many system
> keyspaces/tables there are to know if they need to set the threshold value.
> Moreover, if a new release adds some system tables, the threshold can start
> to be triggered

Re: [VOTE] CEP-30 ANN Vector Search

2023-06-14 Thread Andrew Cobley (Staff)
Hi All,

Great news this has gone through, I wondering if we have a timescale for this 
making it to Beta or release ?  I’m asking because we have a project that would 
benefit from this approach.

Andy


From: Jonathan Ellis 
Date: Tuesday, 30 May 2023 at 14:44
To: dev 
Subject: Re: [VOTE] CEP-30 ANN Vector Search

CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.
Thanks, all.  Closing the vote as accepted with 8 binding +1 (including me) and 
11 non-binding votes.

On Thu, May 25, 2023 at 10:45 AM Jonathan Ellis 
mailto:jbel...@gmail.com>> wrote:
Let's make this official.

CEP: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes

POC that demonstrates all the big rocks, including distributed queries: 
https://github.com/datastax/cassandra/tree/cep-vsearch

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

The University of Dundee is a registered Scottish Charity, No: SC015096


Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-14 Thread Josh McKenzie
> In my opinion including system tables defeats that purpose because it forces 
> users to know details about the system tables.
Perhaps having a unit test that caps our system tables at some value and 
keeping the guardrail user-scope specific would be a better approach. I see 
your point about leaking internal details to users, specifically on things they 
can't control at this point.

On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>> > Default value I agree with you; features should be off by default!  If we 
>> > remove the default then we disable the feature by default (which im cool 
>> > with) and for anyone who changed the config, they would keep their behavior
> 
> I'm glad we agree on at least removing the default value if we keep the 
> deprecated properties.
> 
>> > With that, I kinda don’t agree that including system tables is a mistake, 
>> > as we add more we allow less for user tables before we start to have 
>> > issues….
> 
> That's problematic because the new thresholds we added in CASSANDRA-17147 
> don't include system tables. Do you think we should change that?
> 
> I still think it's better not to include the system tables in the count. The 
> thresholds on the number of keyspaces/tables/rows/columns/tombstones are just 
> guidance since they cannot be exactly related to exact resource consumption. 
> The main purpose of those thresholds is to prevent obvious antipatterns such 
> as creating thousands of tables. A benefit of expressing the guardrails in 
> terms of the number of schema entities, rather than counting the memory usage 
> of those entities, is that they are easy to understand and reason about. In 
> my opinion including system tables defeats that purpose because it forces 
> users to know details about the system tables. The fact that those details 
> change between versions doesn't help. Including system tables is not going to 
> make the thresholds precise in terms of measuring memory consumption because 
> that depends on other factors, such as the columns they store.
> 
> Including system tables also imposes a minimum threshold value, like in 5.0 
> you cannot set a threshold value under 45 tables without triggering it with 
> an empty db. For other thresholds, this can be more tricky. That would be the 
> case of the guardrail on the number of columns in a partition, where you 
> would need to know the size of the widest row in the system tables, which can 
> change over time.
> 
> I guess that if system tables were to be counted, a recommendation for the 
> threshold would say something like "It's recommended not to have more than 
> 150 tables. The system already includes 45 tables for internal usage, so you 
> shouldn't create more than 105 user tables". I find it's better for usability 
> to not count the system tables and just say "It's recommended not to have 
> more than 100 tables. This doesn't include system tables."
> 
> On Tue, 13 Jun 2023 at 23:51, Josh McKenzie  wrote:
>> __
>>> Warning that too many tables (including system) may have negative behavior 
>>> I think is fine
>> This reminds me of the current situation with our tests where we just keep 
>> adding more and more without really considering the value of the current set 
>> and the costs of that body of work as it keeps growing.
>> 
>> Having some kind of signal that we need to do some housekeeping with our 
>> system tables, or *something* in the feedback loop that helps us keep on top 
>> of this hygiene over time, seems like a clear benefit to me.
>> 
>> On Tue, Jun 13, 2023, at 1:42 PM, David Capwell wrote:
 I think that the combined decision of using a default value and counting 
 system tables was a mistake
>>> 
>>> Default value I agree with you; features should be off by default!  If we 
>>> remove the default then we disable the feature by default (which im cool 
>>> with) and for anyone who changed the config, they would keep their behavior
>>> 
>>> As for system tables… each table adds a cost to our bookkeeping, so when we 
>>> add new tables the cost grows and the memory per table decreases, does it 
>>> not?  Warning that too many tables (including system) may have negative 
>>> behavior I think is fine, its only if we start to fail is when things 
>>> become a problem (upgrading to 5.0 can’t happen due to too many tables 
>>> added in the release?); think the feature was warn only, so that should be 
>>> fine.  With that, I kinda don’t agree that including system tables is a 
>>> mistake, as we add more we allow less for user tables before we start to 
>>> have issues…. At the same time, if we have improvements in newer versions 
>>> that allows higher number of tables, the user then has to update their 
>>> configs (well, as long as we don’t make things worse a smaller limit than 
>>> needed is fine…)
>>> 
 we would need to know how many system keyspaces/tables were on the version 
 we are upgrading from
>>> 
>>> Do we?  The logic was pulling from loc

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-14 Thread David Capwell
> That's problematic because the new thresholds we added in CASSANDRA-17147 
> don't include system tables. Do you think we should change that?

I wouldn’t change the semantics of the config as it’s already live.  I guess 
where I am coming from is that logically we have to think about the system 
tables, so to your point, if we think 150 is too much and the system already 
exposes 50… then we should recommend no more than 100…. 

> I find it's better for usability to not count the system tables and just say 
> "It's recommended not to have more than 100 tables. This doesn't include 
> system tables.”


I am fine with this framing… internally we think about 150 but publicly speak 
100 (due to our 50 tables)...


> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
> 
>> In my opinion including system tables defeats that purpose because it forces 
>> users to know details about the system tables.
> Perhaps having a unit test that caps our system tables at some value and 
> keeping the guardrail user-scope specific would be a better approach. I see 
> your point about leaking internal details to users, specifically on things 
> they can't control at this point.
> 
> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>> > Default value I agree with you; features should be off by default!  If we 
>> > remove the default then we disable the feature by default (which im cool 
>> > with) and for anyone who changed the config, they would keep their behavior
>> 
>> I'm glad we agree on at least removing the default value if we keep the 
>> deprecated properties.
>> 
>> > With that, I kinda don’t agree that including system tables is a mistake, 
>> > as we add more we allow less for user tables before we start to have 
>> > issues….
>> 
>> That's problematic because the new thresholds we added in CASSANDRA-17147 
>> don't include system tables. Do you think we should change that?
>> 
>> I still think it's better not to include the system tables in the count. The 
>> thresholds on the number of keyspaces/tables/rows/columns/tombstones are 
>> just guidance since they cannot be exactly related to exact resource 
>> consumption. The main purpose of those thresholds is to prevent obvious 
>> antipatterns such as creating thousands of tables. A benefit of expressing 
>> the guardrails in terms of the number of schema entities, rather than 
>> counting the memory usage of those entities, is that they are easy to 
>> understand and reason about. In my opinion including system tables defeats 
>> that purpose because it forces users to know details about the system 
>> tables. The fact that those details change between versions doesn't help. 
>> Including system tables is not going to make the thresholds precise in terms 
>> of measuring memory consumption because that depends on other factors, such 
>> as the columns they store.
>> 
>> Including system tables also imposes a minimum threshold value, like in 5.0 
>> you cannot set a threshold value under 45 tables without triggering it with 
>> an empty db. For other thresholds, this can be more tricky. That would be 
>> the case of the guardrail on the number of columns in a partition, where you 
>> would need to know the size of the widest row in the system tables, which 
>> can change over time.
>> 
>> I guess that if system tables were to be counted, a recommendation for the 
>> threshold would say something like "It's recommended not to have more than 
>> 150 tables. The system already includes 45 tables for internal usage, so you 
>> shouldn't create more than 105 user tables". I find it's better for 
>> usability to not count the system tables and just say "It's recommended not 
>> to have more than 100 tables. This doesn't include system tables."
>> 
>> On Tue, 13 Jun 2023 at 23:51, Josh McKenzie > > wrote:
>> 
>>> Warning that too many tables (including system) may have negative behavior 
>>> I think is fine
>> This reminds me of the current situation with our tests where we just keep 
>> adding more and more without really considering the value of the current set 
>> and the costs of that body of work as it keeps growing.
>> 
>> Having some kind of signal that we need to do some housekeeping with our 
>> system tables, or something in the feedback loop that helps us keep on top 
>> of this hygiene over time, seems like a clear benefit to me.
>> 
>> On Tue, Jun 13, 2023, at 1:42 PM, David Capwell wrote:
 I think that the combined decision of using a default value and counting 
 system tables was a mistake
>>> 
>>> Default value I agree with you; features should be off by default!  If we 
>>> remove the default then we disable the feature by default (which im cool 
>>> with) and for anyone who changed the config, they would keep their behavior
>>> 
>>> As for system tables… each table adds a cost to our bookkeeping, so when we 
>>> add new tables the cost grows and the memory per table decreases, does it 
>>> not?  Warning 

CEP 33 - CIDR filtering authorizer

2023-06-14 Thread Shailaja Koppu
Hi Team,

I have created CEP 33 - CIDR filtering authorizer 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-33%3A+CIDR+filtering+authorizer
 
.

Purpose of this feature is to add the ability to restrict users accesses based 
on the client’s IP (or region). We can map set of CIDRs to CIDR groups (aka, 
regions), and then enable or disable roles to access from certain CIDR groups. 
CEP page details why are we doing this and how. Please go through it, comment 
here on the discussion thread and vote. 

For your reference, code for this feature is at 
https://github.com/apache/cassandra/pull/2414 
. PR description contains an 
example usage.


Thanks,
Shailaja

New episode of The Apache Cassandra (R) Corner podcast!

2023-06-14 Thread Aaron Ploetz
Link to the next episode:
https://drive.google.com/file/d/1Rzmvj3db_chj6ZLvQ5G2bBLxIF_QaPCV/view?usp=sharing

s2e6 - Mary Grygleski (DataStax)
(You may have to download it to play)

It will remain in staging for 72 hours, going live (assuming no objections)
by Saturday, June 17th.

If anyone should have any questions or comments, or if you want to be a
guest, please reach out to me.

For my guest pipeline, I'm working on coordinating with Charna Parkey and
Josh McKenzie.  But I am looking for additional guests.  So if you know
someone who has a great use case, let me know!

Thanks, everyone!

Aaron Ploetz


Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-14 Thread Adam Holmberg
+1

(long time coming!)

On Wed, Jun 14, 2023 at 3:51 AM Jorge Bay Gondra 
wrote:

> +1 nb
>
> On Wed, Jun 14, 2023 at 9:13 AM Sam Tunnicliffe  wrote:
>
>> +1
>>
>> On 13 Jun 2023, at 15:14, Jeremy Hanna 
>> wrote:
>>
>> Calling for a vote on CEP-8 [1].
>>
>> To clarify the intent, as Benjamin said in the discussion thread [2], the
>> goal of this vote is simply to ensure that the community is in favor of
>> the donation. Nothing more.
>> The plan is to introduce the drivers, one by one. Each driver donation
>> will need to be accepted first by the PMC members, as it is the case for
>> any donation. Therefore the PMC should have full control on the pace at
>> which new drivers are accepted.
>>
>> If this vote passes, we can start this process for the Java driver under
>> the direction of the PMC.
>>
>> Jeremy
>>
>> 1.
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
>> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp
>>
>>
>>


Re: CEP 33 - CIDR filtering authorizer

2023-06-14 Thread Nate McCall
Hi Shailaja,
This looks super interesting. I particularly like the MONITOR switch. This
is a huge pain-point for a lot of cluster migrations.

Cheers,
-Nate

On Thu, Jun 15, 2023 at 6:43 AM Shailaja Koppu  wrote:

> Hi Team,
>
> I have created CEP 33 - CIDR filtering authorizer
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-33%3A+CIDR+filtering+authorizer
> 
> .
>
> Purpose of this feature is to add the ability to restrict users accesses
> based on the client’s IP (or region). We can map set of CIDRs to CIDR
> groups (aka, regions), and then enable or disable roles to access from
> certain CIDR groups. CEP page details why are we doing this and how. Please
> go through it, comment here on the discussion thread and vote.
>
> For your reference, code for this feature is at
> https://github.com/apache/cassandra/pull/2414. PR description contains an
> example usage.
>
>
> Thanks,
> Shailaja
>


Re: [VOTE] CEP-30 ANN Vector Search

2023-06-14 Thread Patrick McFadin
Andy,

Good to see you on the ML again! CEP-30 is slated for release with 5.0
later in the year. Until then, you'll need to do a local build or try it
out in a preview in Astra. A few of us have been talking about creating a
preview docker image since there is some interest in having it run in
k8ssandra. In any case, this is very alpha code and should be treated as
such. Reporting errors or unusual results would be greatly appreciated!

Patrick



On Wed, Jun 14, 2023 at 7:10 AM Andrew Cobley (Staff) <
a.e.cob...@dundee.ac.uk> wrote:

> Hi All,
>
>
>
> Great news this has gone through, I wondering if we have a timescale for
> this making it to Beta or release ?  I’m asking because we have a project
> that would benefit from this approach.
>
>
>
> Andy
>
>
>
>
>
> *From: *Jonathan Ellis 
> *Date: *Tuesday, 30 May 2023 at 14:44
> *To: *dev 
> *Subject: *Re: [VOTE] CEP-30 ANN Vector Search
>
>
>
> CAUTION: This email originated from outside the University of Dundee. Do
> not click links or open attachments unless you recognise the sender's email
> address and know the content is safe.
>
> Thanks, all.  Closing the vote as accepted with 8 binding +1 (including
> me) and 11 non-binding votes.
>
>
>
> On Thu, May 25, 2023 at 10:45 AM Jonathan Ellis  wrote:
>
> Let's make this official.
>
>
> CEP:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>
>
>
> POC that demonstrates all the big rocks, including distributed queries:
> https://github.com/datastax/cassandra/tree/cep-vsearch
>
>
> --
>
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>
> --
>
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
>


Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-14 Thread Patrick McFadin
+1

On Wed, Jun 14, 2023 at 2:39 PM Adam Holmberg 
wrote:

> +1
>
> (long time coming!)
>
> On Wed, Jun 14, 2023 at 3:51 AM Jorge Bay Gondra 
> wrote:
>
>> +1 nb
>>
>> On Wed, Jun 14, 2023 at 9:13 AM Sam Tunnicliffe  wrote:
>>
>>> +1
>>>
>>> On 13 Jun 2023, at 15:14, Jeremy Hanna 
>>> wrote:
>>>
>>> Calling for a vote on CEP-8 [1].
>>>
>>> To clarify the intent, as Benjamin said in the discussion thread [2],
>>> the goal of this vote is simply to ensure that the community is in
>>> favor of the donation. Nothing more.
>>> The plan is to introduce the drivers, one by one. Each driver donation
>>> will need to be accepted first by the PMC members, as it is the case for
>>> any donation. Therefore the PMC should have full control on the pace at
>>> which new drivers are accepted.
>>>
>>> If this vote passes, we can start this process for the Java driver under
>>> the direction of the PMC.
>>>
>>> Jeremy
>>>
>>> 1.
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
>>> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp
>>>
>>>
>>>


Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-14 Thread Jon Haddad
+1

On 2023/06/13 14:14:35 Jeremy Hanna wrote:
> Calling for a vote on CEP-8 [1].
> 
> To clarify the intent, as Benjamin said in the discussion thread [2], the 
> goal of this vote is simply to ensure that the community is in favor of the 
> donation. Nothing more.
> The plan is to introduce the drivers, one by one. Each driver donation will 
> need to be accepted first by the PMC members, as it is the case for any 
> donation. Therefore the PMC should have full control on the pace at which new 
> drivers are accepted.
> 
> If this vote passes, we can start this process for the Java driver under the 
> direction of the PMC.
> 
> Jeremy
> 
> 1. 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp