Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Benedict Elliott Smith
If we’re debating the overall approach, I think we need to define what we want 
to achieve before we pursue any specific design.

I think rate limiting is simply a proxy for cluster stability. I think 
implicitly we also all want to achieve client fairness. Rate limiting is one 
proposal for achieving the first only - a poor one IMO, driven I think by a 
lack of guarantees currently provided by the rest of the system.

Given this lens, load balancing is IMO a performance improvement - an important 
one, but orthogonal to achieving the stability itself. My personal view is that 
the best next step for achieving both is improved load shedding, building on 
Alex’s earlier work, which I think is broadly what Jon is saying.

The best improvement IMO is ensuring load shedding is fair. That is, we should 
shed load to replicas that already have the most work sent to them, and for 
clients with the most work already queued. That is, we should partition the 
client work queue by client, and as we get back-pressure signals from 
overloaded replicas we should being materialising a view of the client work 
queue attributing targets to this replica. We can then expire work based on the 
worst queue depth of any queue. We should also use this to override the 
behaviour of the dynamic snitch; we should not send work to replicas with local 
queues that cannot flush due to back-pressure.

I think, once this is in place, the internal queue depths Jon discusses would 
be useful signals to the internode inbound message handler to propagate back to 
peers’ outbound message queues (and, transitively, the client message queue 
management).

I think there are also improvements to be made regarding how we select work to 
be expired, such as partitioning by workload type, predicting wether there is 
enough time for the system to process a query (e.g. not submitting work that is 
close to its expiration), as considering the size of the client payload, etc. 
Also, choosing when to notify a client of timeout/overload - it may be 
preferential to discard the work but delay notifying the client to avoid the 
client retrying too quickly.



> On 23 Sep 2024, at 08:39, Alex Petrov  wrote:
> 
> > Are those three sufficient to protect against a client that unexpectedly 
> > comes up with 100x a previous provisioned-for workload? Or 100 clients at 
> > 100x concurrently? Given that can be 100x in terms of quantity (helped by 
> > queueing and shedding), but also 100x in terms of computational and disk 
> > implications? We don't have control over what users do, and while in an 
> > ideal world client applications have fairly predictable workloads over 
> > time, in practice things spike around various events for different 
> > applications.
> 
> To fully answer your question I would probably have to flesh out a CEP (which 
> I hope to get to ASAP), and I will elaborate on all of the points on 
> Community over Code this year (and will do my best to put it in writing for 
> the ones who will not attend). But to briefly answer, one of the ideas is 
> exactly the fact that one thing in a queue is not the same as a another thing 
> in a queue. 
> 
> > I see these 2 things as complementary, not as interdependent. Is there 
> > something I'm missing?
> 
> I think if we start working on rate limiting before we implement good load 
> balancing, we are risking of shedding load that could otherwise have been 
> handled by the cluster. I think you even said it yourself with "over time it 
> would raise the ceiling at which rate limiting kicked in".
> 
> Besides, in CASSANDRA-19534 
>  I was attempting to 
> show that we need to find a maximal natural possible throughput that the 
> cluster can handle without tipping, and maintain it. And the easiest way to 
> handle this was, naturally, through the user-set timeouts. We can work in 
> resource limits, but as of now I see only marginal improvement over what we 
> can do just with timeouts.
> 
> One of the risk of a misconfigured rate-limiter is avoidable throttling / 
> shedding. For example, CEP mentions a trigger of a 80% CPU utilization. But 
> while testing 19534, we have seen that we can easily burst into high CPU 
> utilization, let the requests that do not satisfy client timeout boundaries 
> get shedded, and continue operating without any additional rate-limiting. My 
> guess is that the existing rate-limiter sees less usage than we wish it to 
> primarily because it is hard to say where to set the limits to for throwing 
> OverloadedException, and TCP throttling/backoff does not work because we lose 
> the queue arrival timestamp and start triggering client timeouts that are 
> invisible to a server (i.e. client retries while request is still in the 
> queue). While latter problem has a trivial solution (i.e. client-set 
> deadlines), former one probably requires some auto-tuning or guidance. 
> 
> Another example is "culprit Keyspaces" fro

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Štefan Miklošovič
I know it is probably too soon to discuss the implementation details in
depth as it is hard to say precisely how it will look like but I want to
highlight for example this (1). Would some parts of that work touch this
logic?

There is also (2) which tries to solve different but somewhat related
problems. If there is an attacker who just brute-forces passwords, we do
not only want to limit the password hashing, we want to just ban such users
for some time to try again. I do not want to go into details, it is
complicated enough already, I just want to say that this stuff should not
be covered by this work. Auth requests should never be throttled by this
mechanism and we implement a specific rate limiter with additional logic
involved which would solve brute-forces / banning of users for a while for
trying again. We would ideally never submit such a query to a respective
thread pool for further processing if we realize that a particular user is
actually banned for logging because he was trying too often and
unsuccessfully which is rather suspicious.

(1)
https://github.com/apache/cassandra/commit/09b282d1fdd7d6d62542137003011d144c0227be
(2) https://issues.apache.org/jira/browse/CASSANDRA-19734

On Mon, Sep 23, 2024 at 9:41 AM Alex Petrov  wrote:

> > Are those three sufficient to protect against a client that unexpectedly
> comes up with 100x a previous provisioned-for workload? Or 100 clients at
> 100x concurrently? Given that can be 100x in terms of quantity (helped by
> queueing and shedding), but also 100x in terms of *computational and disk
> implications*? We don't have control over what users do, and while in an
> ideal world client applications have fairly predictable workloads over
> time, in practice things spike around various events for different
> applications.
>
> To fully answer your question I would probably have to flesh out a CEP
> (which I hope to get to ASAP), and I will elaborate on all of the points on
> Community over Code this year (and will do my best to put it in writing for
> the ones who will not attend). But to briefly answer, one of the ideas is
> exactly the fact that one thing in a queue is not the same as a another
> thing in a queue.
>
> > I see these 2 things as complementary, not as interdependent. Is there
> something I'm missing?
>
> I think if we start working on rate limiting before we implement good load
> balancing, we are risking of shedding load that could otherwise have been
> handled by the cluster. I think you even said it yourself with "over time
> it would raise the ceiling at which rate limiting kicked in".
>
> Besides, in CASSANDRA-19534
>  I was attempting
> to show that we need to find a maximal natural possible throughput that the
> cluster can handle without tipping, and maintain it. And the easiest way to
> handle this was, naturally, through the user-set timeouts. We can work in
> resource limits, but as of now I see only marginal improvement over what we
> can do just with timeouts.
>
> One of the risk of a misconfigured rate-limiter is avoidable throttling /
> shedding. For example, CEP mentions a trigger of a 80% CPU utilization. But
> while testing 19534, we have seen that we can easily burst into high CPU
> utilization, let the requests that do not satisfy client timeout boundaries
> get shedded, and continue operating without any additional rate-limiting.
> My guess is that the existing rate-limiter sees less usage than we wish it
> to primarily because it is hard to say where to set the limits to for
> throwing OverloadedException, and TCP throttling/backoff does not work
> because we lose the queue arrival timestamp and start triggering client
> timeouts that are invisible to a server (i.e. client retries while request
> is still in the queue). While latter problem has a trivial solution (i.e.
> client-set deadlines), former one probably requires some auto-tuning or
> guidance.
>
> Another example is "culprit Keyspaces" from the CEP. If we introduce
> fairness into our load-balancing, a single keyspace, or a replica set
> (partition) will not be able to dominate the cluster workload, causing
> accross the board timeouts. Which means that by simply giving preference to
> other requests we have naturally shed an imbalance without introducing any
> rate-limiting.
>
> Maybe the problem is in the terminology, but I think we should choose
> "prioritize read/write coordinator workload" over "block any {read/write}
> {coordinator/replication} traffic for a table". Let me make an example.
> Let's say we implement some algorithm for replenishing request allowance
> for write replication for a table, and this table runs out of tokens. If we
> make a decision to shed this request before waiting until the last moment
> it potentially could have gotten processed, we are risking shedding a
> request that can be served. But if we gave such request a lower priority,
> we can get to it when we get to it given

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Alex Petrov
> Are those three sufficient to protect against a client that unexpectedly 
> comes up with 100x a previous provisioned-for workload? Or 100 clients at 
> 100x concurrently? Given that can be 100x in terms of quantity (helped by 
> queueing and shedding), but also 100x in terms of *computational and disk 
> implications*? We don't have control over what users do, and while in an 
> ideal world client applications have fairly predictable workloads over time, 
> in practice things spike around various events for different applications.

To fully answer your question I would probably have to flesh out a CEP (which I 
hope to get to ASAP), and I will elaborate on all of the points on Community 
over Code this year (and will do my best to put it in writing for the ones who 
will not attend). But to briefly answer, one of the ideas is exactly the fact 
that one thing in a queue is not the same as a another thing in a queue. 

> I see these 2 things as complementary, not as interdependent. Is there 
> something I'm missing?

I think if we start working on rate limiting before we implement good load 
balancing, we are risking of shedding load that could otherwise have been 
handled by the cluster. I think you even said it yourself with "over time it 
would raise the ceiling at which rate limiting kicked in".

Besides, in CASSANDRA-19534 
 I was attempting to 
show that we need to find a maximal natural possible throughput that the 
cluster can handle without tipping, and maintain it. And the easiest way to 
handle this was, naturally, through the user-set timeouts. We can work in 
resource limits, but as of now I see only marginal improvement over what we can 
do just with timeouts.

One of the risk of a misconfigured rate-limiter is avoidable throttling / 
shedding. For example, CEP mentions a trigger of a 80% CPU utilization. But 
while testing 19534, we have seen that we can easily burst into high CPU 
utilization, let the requests that do not satisfy client timeout boundaries get 
shedded, and continue operating without any additional rate-limiting. My guess 
is that the existing rate-limiter sees less usage than we wish it to primarily 
because it is hard to say where to set the limits to for throwing 
OverloadedException, and TCP throttling/backoff does not work because we lose 
the queue arrival timestamp and start triggering client timeouts that are 
invisible to a server (i.e. client retries while request is still in the 
queue). While latter problem has a trivial solution (i.e. client-set 
deadlines), former one probably requires some auto-tuning or guidance. 

Another example is "culprit Keyspaces" from the CEP. If we introduce fairness 
into our load-balancing, a single keyspace, or a replica set (partition) will 
not be able to dominate the cluster workload, causing accross the board 
timeouts. Which means that by simply giving preference to other requests we 
have naturally shed an imbalance without introducing any rate-limiting.

Maybe the problem is in the terminology, but I think we should choose 
"prioritize read/write coordinator workload" over "block any {read/write} 
{coordinator/replication} traffic for a table". Let me make an example. Let's 
say we implement some algorithm for replenishing request allowance for write 
replication for a table, and this table runs out of tokens. If we make a 
decision to shed this request before waiting until the last moment it 
potentially could have gotten processed, we are risking shedding a request that 
can be served. But if we gave such request a lower priority, we can get to it 
when we get to it given current resources and queues. If we still can process 
it, we will, even if it is on a very end of our timeout guarantee, and I think 
it does not need to be shed.

I think what I have in mind seems to jibe very well with the points you have 
brought up:
  * How can nodes protect themselves from variable and changing user behavior 
in a way that's minimally disruptive to the user and requires as little 
configuration as possible for operators
  * How do we keep the limits of node performance from leaking into the scope 
of user awareness and responsibility outside simply pushing various exceptions 
to the client to indicate what's going on 

To summarize, I think good load balancing and workload prioritization combined 
with just "give up on a request if we know for a fact we can not process it" 
feels like a simpler way to solve the two problems you mentioned, and it will 
help us maximize cluster utilization while not shedding the load that could 
otherwise have been served, _while_ also reducing latencies across the board.


On Sat, Sep 21, 2024, at 1:35 PM, Josh McKenzie wrote:
> Are those three sufficient to protect against a client that unexpectedly 
> comes up with 100x a previous provisioned-for workload? Or 100 clients at 
> 100x concurrently? Given that can be 100x in terms of quantity (hel

Re: [DISCUSS] Introduce CREATE TABLE LIKE grammer

2024-09-23 Thread Štefan Miklošovič
If we have this table

CREATE TABLE ks.tb2 (
id int PRIMARY KEY,
name text
);

I can either specify name of an index on my own like this:

CREATE INDEX name_index ON ks.tb2 (name) ;

or I can let Cassandra to figure that name on its own:

CREATE INDEX ON ks.tb2 (name) ;

in that case it will name that index "tb2_name_idx".

Hence, I would expect that when we do

ALTER TABLE ks.to_copy LIKE ks.tb2 WITH INDICES;

Then ks.to_copy table will have an index which is called "to_copy_name_idx"
without me doing anything.

For types, we do not need to do anything when we deal with the same
keyspace. For simplicity, I mentioned that we might deal with the same
keyspace scenario only for now and iterate on that in the future.

On Mon, Sep 23, 2024 at 8:53 AM guo Maxwell  wrote:

> Hello everyone,
>
> Cep is being written, and I encountered some problems during the process.
> I would like to discuss them with you. If you read the description of this
> CASSANDRA-7662 , we
> will find that initially the original creator of this jira did not intend
> to implement structural copying of indexes, views, and triggers  only the
> column and its data type.
>
> However, after investigating some db related syntax and function
> implementation, I found that it may be necessary for us to provide some
> rich syntax to support the replication of indexes, views, etc.
>
> In order to support selective copy of the basic structure of the table
> (columns and types), table options, table-related indexes, views, triggers,
> etc. We need some new syntax, it seems that the syntax of pg is relatively
> comprehensive, it use the keyword INCLUDING/EXCLUDING to flexibly control
> the removal and retention of indexes, table information, etc. see pg
> create table like
>  , the new
> created index name is different from the original table's index name , 
> seenewly
> copied index names are different from original
> 
> , the name is based on some rule.
> Mysql is relatively simple and copies columns and indexes by default. see 
> mysql
> create table like
>  and the
> newly created index name is the same with the original table's index name.
>
> So for Casandra, I hope it can also support the information copy of index
> and even view/trigger. And I also hope to be able to flexibly decide which
> information is copied like pg.
>
> Besides, I think the copy can happen between different keyspaces. And UDT
> needs to be taken into account.
>
> But as we know the index/view/trigger name are all under keyspace level,
> so it seems that the newly created index name (or view name/ trigger name)
> must be different from the original tables' ,otherwise  names would clash .
>
> So regarding the above problem, one idea I have is that for newly created
> types, indexes and views under different keyspaces and the same keyspace,
> we first generate random names for them, and then we can add the ability of
> modifying the names(for types/indexes/views/triggers) so that users can
> manually change the names.
>
>
> guo Maxwell  于2024年9月20日周五 08:06写道:
>
>> No,I think still need some discuss on grammar detail after I finish the
>> first version
>>
>> Patrick McFadin 于2024年9月20日 周五上午2:24写道:
>>
>>> Is this CEP ready for a VOTE thread?
>>>
>>> On Sat, Aug 24, 2024 at 8:56 PM guo Maxwell 
>>> wrote:
>>>
 Thank you for your replies, I will prepare a CEP later.

 Patrick McFadin  于2024年8月20日周二 02:11写道:

> +1 This is a CEP
>
> On Mon, Aug 19, 2024 at 10:50 AM Jon Haddad  wrote:
>
>> Given the fairly large surface area for this, i think it should be a
>> CEP.
>>
>> —
>> Jon Haddad
>> Rustyrazorblade Consulting
>> rustyrazorblade.com
>>
>>
>> On Mon, Aug 19, 2024 at 10:44 AM Bernardo Botella <
>> conta...@bernardobotella.com> wrote:
>>
>>> Definitely a nice addition to CQL.
>>>
>>> Looking for inspiration at how Postgres and Mysql do that may also
>>> help with the final design (I like the WITH proposed by Stefan, but I 
>>> would
>>> definitely take a look at the INCLUDING keyword proposed by Postgres).
>>> https://www.postgresql.org/docs/current/sql-createtable.html
>>> https://dev.mysql.com/doc/refman/8.4/en/create-table-like.html
>>>
>>> On top of that, and as part of the interesting questions, I would
>>> like to add the permissions to the mix. Both the question about copying
>>> them over (with a WITH keyword probably), and the need for read 
>>> permissions
>>> on the source table as well.
>>>
>>> Bernardo
>>>
>>> On Aug 19, 2024, at 10:01 AM, Štefan Miklošovič <
>>> smikloso...@apache.org> wrote:
>>>
>>> BTW this would be cool to do as wel

Re: CEP-15: Accord status

2024-09-23 Thread Caleb Rackliffe
There is also a Jira to track pre-merge tasks here: 
https://issues.apache.org/jira/browse/CASSANDRA-18196

> On Sep 20, 2024, at 4:09 PM, Josh McKenzie  wrote:
> 
> 
>> 
>> This presents an opportune moment for those interested to review the code.
>> ...
>> +88,341 −7,341
>> 1003 Files changed
> 
> O.o
> This is... very large. If we use CASSANDRA-8099 as our "banana for scale":
>> 645 files changed, 49381 insertions(+), 42227 deletions(-)
> 
> To be clear - I don't think we collectively should be worried about 
> disruption from this patch since:
> Each commit (or the vast majority?) has already been reviewed by >= 1 other 
> committer
> 7.3k deletions is a lot less than 42k
> We now have fuzzing, property based testing, and the simulator
> Most of this code is additive
> How would you recommend interested parties engage with reviewing this 
> behemoth? Or perhaps subsections of it or key areas to familiarize themselves 
> with the structure?
> 
>> On Fri, Sep 20, 2024, at 12:17 PM, David Capwell wrote:
>> Recently, we rebased against the trunk branch, ensuring that the accord 
>> branch is now in sync with the latest trunk version. This presents an 
>> opportune moment for those interested to review the code.
>> 
>> We have a pending pull request 
>> (https://github.com/apache/cassandra/pull/3552) that we do not intend to 
>> merge.
>> 
>> Our current focus is on addressing several bug fixes and ensuring the safety 
>> of topology changes (as evidenced by the number of issues filed against the 
>> trunk). Once we wrap up bug fixes and safety features, we will likely 
>> discuss the merge to trunk, so now is a great time to start engaging.
>> 
>> Thank you everyone for your patience!
> 


[Discuss] CASSANDRA-17666, disable write path for cdc

2024-09-23 Thread Nikolai Loginov
Is it possible to  backport changes from the CASSANDRA-17666 to the 4.1
branch?

Regards

Nikolai Loginov