Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Benedict Elliott Smith
If we’re debating the overall approach, I think we need to define what we want to achieve before we pursue any specific design. I think rate limiting is simply a proxy for cluster stability. I think implicitly we also all want to achieve client fairness. Rate limiting is one proposal for achiev

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Štefan Miklošovič
I know it is probably too soon to discuss the implementation details in depth as it is hard to say precisely how it will look like but I want to highlight for example this (1). Would some parts of that work touch this logic? There is also (2) which tries to solve different but somewhat related pro

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-23 Thread Alex Petrov
> Are those three sufficient to protect against a client that unexpectedly > comes up with 100x a previous provisioned-for workload? Or 100 clients at > 100x concurrently? Given that can be 100x in terms of quantity (helped by > queueing and shedding), but also 100x in terms of *computational an

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-21 Thread Jon Haddad
Oh, one last thing. If the client drivers were to implement a rate limiter based on each node's error rate, and had the ability to back off, paired with CASSANDRA-19534 , I think the majority of severe cluster outages that people experience wo

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-21 Thread Jon Haddad
Can you elaborate what “the bad” is here? Maybe a scenario would help. I’m trying to visualize what kind of workload would be running where you wouldn’t have timeouts or a deep queue yet a node is overloaded. What is “the bad” if requests aren’t timing out? How is a node overloaded if there isn’t

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-21 Thread Jordan West
I agree with Josh. We need to be able to protect from a sudden burst of traffic. 19534 went a long way in that regard — at least wrt to minimizing the effects. The challenge with latency and queue depths can be that they trigger when the bad has already occurred. One other thing we are considering

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-21 Thread Josh McKenzie
Are those three sufficient to protect against a client that unexpectedly comes up with 100x a previous provisioned-for workload? Or 100 clients at 100x concurrently? Given that can be 100x in terms of quantity (helped by queueing and shedding), but also 100x in terms of *computational and disk i

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-21 Thread Alex Petrov
> Personally, I’m a bit skeptical that we will come up with a metric based > heuristic that works well in most scenarios and doesn’t require significant > knowledge and tuning. I think past implementations of the dynamic snitch are > good evidence of that. I am more optimistic on that font. I t

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-20 Thread Jordan West
+1 to Benedict’s (and others) comments on plugability and low overhead when disabled. The latter I think needs little justification. The reason I am big on the former is, in my opinion: decisions on approach need to be settled with numbers not anecdotes or past experience (including my own). So I w

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-20 Thread Jon Haddad
Assuming the intent was to migrate the Google Doc to the CEP, I took another look. I think there's some ambitious ideas here, and I appreciate any effort to improve Cassandra's stability. I think CASSANDRA-19534 was a massive step in the rig

Re: [EXTERNAL] [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-09-19 Thread Benedict Elliott Smith
I just want to flag here that this is a topic I have strong opinions on, but the CEP is not really specific or detailed enough to understand precisely how it will be implemented. So, if a patch is already being produced, most of my feedback is likely to be provided some time after a patch appear