Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
Hi German,

The goal is always to backport our learnings back to the community. For
example, I have already successfully backported the following two
enhancements/bug fixes back to the Open Source Cassandra, which are
described in the article. I am already currently working on open-source a
few more enhancements mentioned in the article back to the open-source.

   1. https://issues.apache.org/jira/browse/CASSANDRA-18555
   2. https://issues.apache.org/jira/browse/CASSANDRA-13740

There is definitely heavy interest in having the repair solution inside the
Open Source Cassandra itself, very much like Compaction. As I write this
email, we are internally working on a one-pager proposal doc to all the
community members on having a repair inside the OSS Apache Cassandra along
with our private fork - I will share it soon.

Generally, we are ok with any solution getting adopted (either Joey's
solution or our repair solution or any other solution). The primary
motivation is to have the repair embedded inside the open-source Cassandra
itself, so we can retire all various privately developed solutions
eventually :)

I am also happy to help (drive conversation, discussion, etc.) in any way
to have a repair solution adopted inside Cassandra itself, please let me
know. Happy to help!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> All,
>
> We had a brief discussion in [2] about the Uber article [1] where they
> talk about having integrated repair into Cassandra and how great that is. I
> expressed my disappointment that they didn't work with the community on
> that (Uber, if you are listening time to make amends 🙂) and it turns out
> Joey already had the idea and wrote the code [3] - so I wanted to start a
> discussion to gauge interest and maybe how to revive that effort.
>
> Thanks,
> German
>
> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>


Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
To clarify the repair solution timing, the one we have listed in the
article is not the recently developed one. We were hitting some
high-priority production challenges back in early 2018, and to address
that, we developed and rolled out the solution in production in just a few
months. The timing-wise, the solution was developed and productized by Q3
2018, of course, continued to evolve thereafter. Usually, we explore the
existing solutions we can leverage, but when we started our journey in
early 2018, most of the solutions were based on sidecar solutions. There is
nothing against the sidecar solution; it was just a pure business decision,
and in that, we wanted to avoid the sidecar to avoid a dependency on the
control plane. Every solution developed has its deep context, merits, and
pros and cons; they are all great solutions!

An appeal to the community members is to think one more time about having
repairs in the Open Source Cassandra itself. As mentioned in my previous
email, any solution getting adopted is fine; the important aspect is to
have a repair solution in the OSS Cassandra itself!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia 
wrote:

> Hi German,
>
> The goal is always to backport our learnings back to the community. For
> example, I have already successfully backported the following two
> enhancements/bug fixes back to the Open Source Cassandra, which are
> described in the article. I am already currently working on open-source a
> few more enhancements mentioned in the article back to the open-source.
>
>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>
> There is definitely heavy interest in having the repair solution inside
> the Open Source Cassandra itself, very much like Compaction. As I write
> this email, we are internally working on a one-pager proposal doc to all
> the community members on having a repair inside the OSS Apache Cassandra
> along with our private fork - I will share it soon.
>
> Generally, we are ok with any solution getting adopted (either Joey's
> solution or our repair solution or any other solution). The primary
> motivation is to have the repair embedded inside the open-source Cassandra
> itself, so we can retire all various privately developed solutions
> eventually :)
>
> I am also happy to help (drive conversation, discussion, etc.) in any way
> to have a repair solution adopted inside Cassandra itself, please let me
> know. Happy to help!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
>> All,
>>
>> We had a brief discussion in [2] about the Uber article [1] where they
>> talk about having integrated repair into Cassandra and how great that is. I
>> expressed my disappointment that they didn't work with the community on
>> that (Uber, if you are listening time to make amends 🙂) and it turns out
>> Joey already had the idea and wrote the code [3] - so I wanted to start a
>> discussion to gauge interest and maybe how to revive that effort.
>>
>> Thanks,
>> German
>>
>> [1]
>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>>
>


Re: [Discuss] Repair inside C*

2023-07-25 Thread Jaydeep Chovatia
Sounds good, German. Feel free to let me know if you need my help in filing
CEP, adding supporting content to the CEP, etc.
As I mentioned previously, I have already been working (going through an
internal review) on creating a one-pager doc, code, etc., that has been
working for us for the last six years at an immense scale, and I will share
it soon on a private fork.

Thanks,
Jaydeep

On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> In [2] we suggested that the next step should be a CEP.
>
> I am happy to lend a hand to this effort as well.
>
> Thanks Jaydeep and David - really appreciated.
>
> German
>
> --
> *From:* David Capwell 
> *Sent:* Tuesday, July 25, 2023 8:32 AM
> *To:* dev 
> *Cc:* German Eichberger 
> *Subject:* [EXTERNAL] Re: [Discuss] Repair inside C*
>
> As someone who has done a lot of work trying to make repair stable, I
> approve of this message ^_^
>
> More than glad to help mentor this work
>
> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia 
> wrote:
>
> To clarify the repair solution timing, the one we have listed in the
> article is not the recently developed one. We were hitting some
> high-priority production challenges back in early 2018, and to address
> that, we developed and rolled out the solution in production in just a few
> months. The timing-wise, the solution was developed and productized by Q3
> 2018, of course, continued to evolve thereafter. Usually, we explore the
> existing solutions we can leverage, but when we started our journey in
> early 2018, most of the solutions were based on sidecar solutions. There is
> nothing against the sidecar solution; it was just a pure business decision,
> and in that, we wanted to avoid the sidecar to avoid a dependency on the
> control plane. Every solution developed has its deep context, merits, and
> pros and cons; they are all great solutions!
>
> An appeal to the community members is to think one more time about having
> repairs in the Open Source Cassandra itself. As mentioned in my previous
> email, any solution getting adopted is fine; the important aspect is to
> have a repair solution in the OSS Cassandra itself!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
> Hi German,
>
> The goal is always to backport our learnings back to the community. For
> example, I have already successfully backported the following two
> enhancements/bug fixes back to the Open Source Cassandra, which are
> described in the article. I am already currently working on open-source a
> few more enhancements mentioned in the article back to the open-source.
>
>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>
> There is definitely heavy interest in having the repair solution inside
> the Open Source Cassandra itself, very much like Compaction. As I write
> this email, we are internally working on a one-pager proposal doc to all
> the community members on having a repair inside the OSS Apache Cassandra
> along with our private fork - I will share it soon.
>
> Generally, we are ok with any solution getting adopted (either Joey's
> solution or our repair solution or any other solution). The primary
> motivation is to have the repair embedded inside the open-source Cassandra
> itself, so we can retire all various privately developed solutions
> eventually :)
>
> I am also happy to help (drive conversation, discussion, etc.) in any way
> to have a repair solution adopted inside Cassandra itself, please let me
> know. Happy to help!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
> All,
>
> We had a brief discussion in [2] about the Uber article [1] where they
> talk about having integrated repair into Cassandra and how great that is. I
> expressed my disappointment that they didn't work with the community on
> that (Uber, if you are listening time to make amends 🙂) and it turns out
> Joey already had the idea and wrote the code [3] - so I wanted to start a
> discussion to gauge interest and maybe how to revive that effort.
>
> Thanks,
> German
>
> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>
>
>


[Discuss] Detecting token-ownership mismatch

2023-08-16 Thread Jaydeep Chovatia
Hi,


As we know, Cassandra exchanges important topology and
token-ownership-related details over Gossip. Cassandra internally maintains
the following two separate caches that have the token-ownership information
maintained: 1) Gossip cache and 2) Storage Service cache. The first Gossip
cache is updated on a node, followed by the storage service cache. In the
hot path, ownership is calculated from the storage service cache. Since two
separate caches maintain the same information, then inconsistencies are
bound to happen. It could be very well feasible that the Gossip cache has
up-to-date ownership of the Cassandra cluster, but the service cache does
not, and in that scenario, inconsistent data will be served to the user.

Currently, there is no mechanism in Cassandra that detects and fixes these
two caches.

*Long-term solution*
We are going with the long-term transactional metadata (
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21) to handle
such inconsistencies, and that’s the right thing to do.

*Short-term solution*
But CEP-21 might take some time, and until then, there is a need to
*detect* such
inconsistencies. Once we detect inconsistencies, then we could have two
options: 1) restart the node or 2) Fix the inconsistencies on-the-fly.

I've created the following JIRA for the short-term fix:
https://issues.apache.org/jira/browse/CASSANDRA-18758


Does this sound valuable?


Jaydeep


Re: [Discuss] Repair inside C*

2023-08-24 Thread Jaydeep Chovatia
Is anyone going to file an official CEP for this?
As mentioned in this email thread, here is one of the solution's design doc

and source code on a private Apache Cassandra patch. Could you go through
it and let me know what you think?

Jaydeep

On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
wrote:

> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad 
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or official C* sidecar) is incredibly painful for folks.  The
> idea that your data integrity needs to be opt-in has never made sense to me
> from the perspective of either the product or the end user.
> > >
> > > I've worked with way too many teams that have either configured this
> incorrectly or not at all.
> > >
> > > Ideally Cassandra would ship with repair built in and on by default.
> Power users can disable if they want to continue to maintain their own
> repair tooling for some reason.
> > >
> > > Jon
> > >
> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> > >> All,
> > >> We had a brief discussion in [2] about the Uber article [1] where
> they talk about having integrated repair into Cassandra and how great that
> is. I expressed my disappointment that they didn't work with the community
> on that (Uber, if you are listening time to make amends 🙂) and it turns
> out Joey already had the idea and wrote the code [3] - so I wanted to start
> a discussion to gauge interest and maybe how to revive that effort.
> > >> Thanks,
> > >> German
> > >> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> > >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> > >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
> >
>


Need Confluent "Create" permission for filing a CEP

2023-10-09 Thread Jaydeep Chovatia
Hi,

I want to create a new CEP request but do not see the "Create" page
permission on Confluent
.
Could someone permit me?
Here is the CEP draft: [DRAFT] CEP - Apache Cassandra Official Repair
Solution - Google Docs


My confluent user-id is: chovatia.jayd...@gmail.com

Jaydeep


Re: Need Confluent "Create" permission for filing a CEP

2023-10-10 Thread Jaydeep Chovatia
Thank you!

On Tue, Oct 10, 2023 at 2:58 AM Brandon Williams  wrote:

> I've added you, you should have access now.
>
> Kind Regards,
> Brandon
>
> On Tue, Oct 10, 2023 at 1:24 AM Jaydeep Chovatia
>  wrote:
> >
> > Hi,
> >
> > I want to create a new CEP request but do not see the "Create" page
> permission on Confluent. Could someone permit me?
> > Here is the CEP draft: [DRAFT] CEP - Apache Cassandra Official Repair
> Solution - Google Docs
> >
> > My confluent user-id is: chovatia.jayd...@gmail.com
> >
> > Jaydeep
>


[Discuss] Generic Purpose Rate Limiter in Cassandra

2024-01-16 Thread Jaydeep Chovatia
Hi,

Happy New Year!

I would like to discuss the following idea:

Open-source Cassandra (CASSANDRA-15013
) has an elementary
built-in memory rate limiter based on the incoming payload from user
requests. This rate limiter activates if any incoming user request’s
payload exceeds certain thresholds. However, the existing rate limiter only
solves limited-scope issues. Cassandra's server-side meltdown due to
overload is a known problem. Often we see that a couple of busy nodes take
down the entire Cassandra ring due to the ripple effect. The following
document proposes a generic purpose comprehensive rate limiter that works
considering system signals, such as CPU, and internal signals, such as
thread pools. The rate limiter will have knobs to filter out internal
traffic, system traffic, replication traffic, and furthermore based on the
types of queries.

More design details to this doc: [OSS] Cassandra Generic Purpose Rate
Limiter - Google Docs


Please let me know your thoughts.

Jaydeep


Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-01-16 Thread Jaydeep Chovatia
Hi Stefan,

Please find my response below:
1) Currently, I am keeping the signals as interface, so one can override
with a different implementation, but a point noted that even the interface
APIs could be also made dynamic so one can define APIs and its
implementation, if they wish to override.
2) I've not looked into that yet, but I will look into it and see if it can
be easily integrated into the Guardrails framework.
3) On the server side, when the framework detects that a node is
overloaded, then it will throw *OverloadedException* back to the client.
Because if the node while busy continues to serve additional requests, then
it will slow down other peer nodes due to dependencies on meeting the
QUORUM, etc. In this, we are at least preventing server nodes from melting
down, and giving the control to the client via *OverloadedException.* Now,
it will be up to the client policy, if client wishes to retry immediately
on a different server node then eventually that server node might be
impacted, but if client wishes to do exponential back off or throw
exception back to the application then that server node will not be
impacted.


Jaydeep

On Tue, Jan 16, 2024 at 10:03 AM Štefan Miklošovič <
stefan.mikloso...@gmail.com> wrote:

> Hi Jaydeep,
>
> That seems quite interesting. Couple points though:
>
> 1) It would be nice if there is a way to "subscribe" to decisions your
> detection framework comes up with. Integration with e.g. diagnostics
> subsystem would be beneficial. This should be pluggable - just coding up an
> interface to dump / react on the decisions how I want. This might also act
> as a notifier to other systems, e-mail, slack channels ...
>
> 2) Have you tried to incorporate this with the Guardrails framework? I
> think that if something is detected to be throttled or rejected (e.g
> writing to a table), there might be a guardrail which would be triggered
> dynamically in runtime. Guardrails are useful as such but here we might
> reuse them so we do not need to code it twice.
>
> 3) I am curious how complex this detection framework would be, it can be
> complicated pretty fast I guess. What would be desirable is to act on it in
> such a way that you will not put that node under even more pressure. In
> other words, your detection system should work in such a way that there
> will not be any "doom loop" whereby mere throttling of various parts of
> Cassandra you make it even worse for other nodes in the cluster. For
> example, if a particular node starts to be overwhelmed and you detect this
> and requests start to be rejected, is it not possible that Java driver
> would start to see this node as "erroneous" with delayed response time etc
> and it would start to prefer other nodes in the cluster when deciding what
> node to contact for query coordination? So you would put more load on other
> nodes, making them more susceptible to be throttled as well ...
>
> Regards
>
> Stefan Miklosovic
>
> On Tue, Jan 16, 2024 at 6:41 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hi,
>>
>> Happy New Year!
>>
>> I would like to discuss the following idea:
>>
>> Open-source Cassandra (CASSANDRA-15013
>> <https://issues.apache.org/jira/browse/CASSANDRA-15013>) has an
>> elementary built-in memory rate limiter based on the incoming payload from
>> user requests. This rate limiter activates if any incoming user request’s
>> payload exceeds certain thresholds. However, the existing rate limiter only
>> solves limited-scope issues. Cassandra's server-side meltdown due to
>> overload is a known problem. Often we see that a couple of busy nodes take
>> down the entire Cassandra ring due to the ripple effect. The following
>> document proposes a generic purpose comprehensive rate limiter that works
>> considering system signals, such as CPU, and internal signals, such as
>> thread pools. The rate limiter will have knobs to filter out internal
>> traffic, system traffic, replication traffic, and furthermore based on the
>> types of queries.
>>
>> More design details to this doc: [OSS] Cassandra Generic Purpose Rate
>> Limiter - Google Docs
>> <https://docs.google.com/document/d/1w-A3fnoeBS6tS1ffBda_R0QR90olzFoMqLE7znFEUrQ/edit>
>>
>> Please let me know your thoughts.
>>
>> Jaydeep
>>
>


Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-01-17 Thread Jaydeep Chovatia
Jon,

The major challenge with latency based rate limiters is that the latency is
subjective from one workload to another. As a result, in the proposal I
have described, the idea is to make decision on the following combinations:

   1. System parameters (such as CPU usage, etc.)
   2. Cassandra thread pools health (are they dropping requests, etc.)

And if these two are +ve then consider the server under pressure. And once
it is under the pressure, then shed the traffic from less aggressive to
more aggressive, etc. The idea is to prevent Cassandra server from melting
(by considering the above two signals to begin with and add any more based
on the learnings)

Scott,

Yes, I did look at some of the implementations, but they are all great
systems and helping quite a lot. But they are still not relying on system
health, etc. and also not in the generic coordinator/replication read/write
path. The idea here is on the similar lines as the existing
implementations, but making it a bit more generic and trying to cover as
many paths as possible.

German,

Sure, let's first continue the discussions here. If it turns out that there
is no widespread interest in the idea then we can do 1:1 and see how we can
help each other on a private fork, etc.

Jaydeep

On Wed, Jan 17, 2024 at 7:57 AM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> Jaydeep,
>
> I concur with Stefan that extensibility of this  should be a design goal:
>
>- It should be easy to add additional metrics (e.g. write queue depth)
>and decision logic
>- There should be a way to interact with other systems to signal a
>resource need  which then could kick off things like scaling
>
>
> Super interested in this and we have been thinking about siimilar things
> internally 😉
>
> Thanks,
> German
> --
> *From:* Jaydeep Chovatia 
> *Sent:* Tuesday, January 16, 2024 1:16 PM
> *To:* dev@cassandra.apache.org 
> *Subject:* [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in
> Cassandra
>
> You don't often get email from chovatia.jayd...@gmail.com. Learn why this
> is important <https://aka.ms/LearnAboutSenderIdentification>
> Hi Stefan,
>
> Please find my response below:
> 1) Currently, I am keeping the signals as interface, so one can override
> with a different implementation, but a point noted that even the interface
> APIs could be also made dynamic so one can define APIs and its
> implementation, if they wish to override.
> 2) I've not looked into that yet, but I will look into it and see if it
> can be easily integrated into the Guardrails framework.
> 3) On the server side, when the framework detects that a node is
> overloaded, then it will throw *OverloadedException* back to the client.
> Because if the node while busy continues to serve additional requests, then
> it will slow down other peer nodes due to dependencies on meeting the
> QUORUM, etc. In this, we are at least preventing server nodes from melting
> down, and giving the control to the client via *OverloadedException.*
> Now, it will be up to the client policy, if client wishes to retry
> immediately on a different server node then eventually that server node
> might be impacted, but if client wishes to do exponential back off or throw
> exception back to the application then that server node will not be
> impacted.
>
>
> Jaydeep
>
> On Tue, Jan 16, 2024 at 10:03 AM Štefan Miklošovič <
> stefan.mikloso...@gmail.com> wrote:
>
> Hi Jaydeep,
>
> That seems quite interesting. Couple points though:
>
> 1) It would be nice if there is a way to "subscribe" to decisions your
> detection framework comes up with. Integration with e.g. diagnostics
> subsystem would be beneficial. This should be pluggable - just coding up an
> interface to dump / react on the decisions how I want. This might also act
> as a notifier to other systems, e-mail, slack channels ...
>
> 2) Have you tried to incorporate this with the Guardrails framework? I
> think that if something is detected to be throttled or rejected (e.g
> writing to a table), there might be a guardrail which would be triggered
> dynamically in runtime. Guardrails are useful as such but here we might
> reuse them so we do not need to code it twice.
>
> 3) I am curious how complex this detection framework would be, it can be
> complicated pretty fast I guess. What would be desirable is to act on it in
> such a way that you will not put that node under even more pressure. In
> other words, your detection system should work in such a way that there
> will not be any "doom loop" whereby mere throttling of various parts of
> Cassandra you make it even worse for other nodes in the cluster. For
> example, if a particular node sta

Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-01-22 Thread Jaydeep Chovatia
resources to ensure the DB stays
>> online and healthy would be a big win.
>>
>> > The major challenge with latency based rate limiters is that the
>> latency is subjective from one workload to another.
>>
>> You're absolutely right.  This goes to my other suggestion that
>> client-side rate limiting would be a higher priority (on my list at least)
>> as it is perfectly suited for multiple varying workloads.  Of course, if
>> you're not interested in working on the drivers and only on C* itself, this
>> is a moot point.  You're free to work on whatever you want - I just think
>> there's a ton more value in the drivers being able to throttle requests to
>> deal than server side.
>>
>> > And if these two are +ve then consider the server under pressure. And
>> once it is under the pressure, then shed the traffic from less aggressive
>> to more aggressive, etc. The idea is to prevent Cassandra server from
>> melting (by considering the above two signals to begin with and add any
>> more based on the learnings)
>>
>> Yes, I agree using dropped metrics (errors) is useful, as well as queue
>> length.  I can't remember offhand all the details of the request queue and
>> how load shedding works there, I need to go back and look.  If we don't
>> already have load shedding based on queue depth that seems like an easy
>> thing to do immediately, and is a high quality signal.  Maybe someone can
>> remind me if we have that already?
>>
>> My issue with using CPU to rate limit clients is that I think it's a very
>> low quality signal, and I suspect it'll trigger a ton of false positives.
>> For example, there's a big difference from performance being impacted by
>> repair vs large reads vs backing up a snapshot to an object store, but they
>> have similar effects on the CPU - high I/O, high CPU usage, both sustained
>> over time.  Imo it would be a pretty bad decision to throttle clients when
>> we should be throttling repair instead, and we should only do so if it's
>> actually causing an issue for the client, something CPU usage can't tell
>> us, only the response time and error rates can.
>>
>> In the case of a backup, throttling might make sense, or might not, it
>> really depends on the environment and if backups are happening
>> concurrently.  If a backup's configured with nice +19 (as it should be),
>> I'd consider throttling user requests to be a false positive, potentially
>> one that does more harm than good to the cluster, since the OS should be
>> deprioritizing the backup for us rather than us deprioritizing C*.
>>
>> In my ideal world, if C* detected problematic response times (possibly
>> violating a per-table, target latency time) or query timeouts, it would
>> start by throttling back compactions, repairs, and streaming to ensure
>> client requests can be serviced.  I think we'd need to define the latency
>> targets in order for this to work optimally, b/c you might not want to wait
>> for query timeouts before you throttle.  I think there's a lot of value in
>> dynamically adaptive compaction, repair, and streaming since it would
>> prioritize user requests, but again, if you're not willing to work on that,
>> it's your call.
>>
>> Anyways - I like the idea of putting more safeguards in the database
>> itself, we're fundamentally in agreement there.  I see a ton of value in
>> having flexible rate limiters, whether it be per-table, keyspace, or
>> user+table combination.  I'd also like to ensure the feature doesn't cause
>> more disruptions than it solves, which I think would be the case from using
>> CPU usage as a signal.
>>
>> Jon
>>
>>
>> On Wed, Jan 17, 2024 at 10:26 AM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Jon,
>>>
>>> The major challenge with latency based rate limiters is that the latency
>>> is subjective from one workload to another. As a result, in the proposal I
>>> have described, the idea is to make decision on the following combinations:
>>>
>>>1. System parameters (such as CPU usage, etc.)
>>>2. Cassandra thread pools health (are they dropping requests, etc.)
>>>
>>> And if these two are +ve then consider the server under pressure. And
>>> once it is under the pressure, then shed the traffic from less aggressive
>>> to more aggressive, etc. The idea is to prevent Cassandra server from
>>> melting (by considering the above two signals to begin

Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-02-07 Thread Jaydeep Chovatia
I see a lot of great ideas being discussed or proposed in the past to cover
the most common rate limiter candidate use cases. Do folks think we should
file an official CEP and take it there?

Jaydeep

On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe 
wrote:

> I just remembered the other day that I had done a quick writeup on the
> state of compaction stress-related throttling in the project:
>
>
> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>
> I'm sure most of it is old news to the people on this thread, but I
> figured I'd post it just in case :)
>
> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie 
> wrote:
>
>> 2.) We should make sure the links between the "known" root causes of
>> cascading failures and the mechanisms we introduce to avoid them remain
>> very strong.
>>
>> Seems to me that our historical strategy was to address individual known
>> cases one-by-one rather than looking for a more holistic load-balancing and
>> load-shedding solution. While the engineer in me likes the elegance of a
>> broad, more-inclusive *actual SEDA-like* approach, the pragmatist in me
>> wonders how far we think we are today from a stable set-point.
>>
>> i.e. are we facing a handful of cases where nodes can still get pushed
>> over and then cascade that we can surgically address, or are we facing a
>> broader lack of back-pressure that rears its head in different domains
>> (client -> coordinator, coordinator -> replica, internode with other
>> operations, etc) at surprising times and should be considered more
>> holistically?
>>
>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>
>> I almost forgot CASSANDRA-15817, which introduced
>> reject_repair_compaction_threshold, which provides a mechanism to stop
>> repairs while compaction is underwater.
>>
>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe 
>> wrote:
>>
>> 
>> Hey all,
>>
>> I'm a bit late to the discussion. I see that we've already discussed
>> CASSANDRA-15013 
>>  and CASSANDRA-16663
>>  at least in
>> passing. Having written the latter, I'd be the first to admit it's a crude
>> tool, although it's been useful here and there, and provides a couple
>> primitives that may be useful for future work. As Scott mentions, while it
>> is configurable at runtime, it is not adaptive, although we did
>> make configuration easier in CASSANDRA-17423
>> . It also is
>> global to the node, although we've lightly discussed some ideas around
>> making it more granular. (For example, keyspace-based limiting, or limiting
>> "domains" tagged by the client in requests, could be interesting.) It also
>> does not deal with inter-node traffic, of course.
>>
>> Something we've not yet mentioned (that does address internode traffic)
>> is CASSANDRA-17324
>> , which I
>> proposed shortly after working on the native request limiter (and have just
>> not had much time to return to). The basic idea is this:
>>
>> When a node is struggling under the weight of a compaction backlog and
>> becomes a cause of increased read latency for clients, we have two safety
>> valves:
>>
>> 1.) Disabling the native protocol server, which stops the node from
>> coordinating reads and writes.
>> 2.) Jacking up the severity on the node, which tells the dynamic snitch
>> to avoid the node for reads from other coordinators.
>>
>> These are useful, but we don’t appear to have any mechanism that would
>> allow us to temporarily reject internode hint, batch, and mutation messages
>> that could further delay resolution of the compaction backlog.
>>
>>
>> Whether it's done as part of a larger framework or on its own, it still
>> feels like a good idea.
>>
>> Thinking in terms of opportunity costs here (i.e. where we spend our
>> finite engineering time to holistically improve the experience of operating
>> this database) is healthy, but we probably haven't reached the point of
>> diminishing returns on nodes being able to protect themselves from clients
>> and from other nodes. I would just keep in mind two things:
>>
>> 1.) The effectiveness of rate-limiting in the system (which includes the
>> database and all clients) as a whole necessarily decreases as we move from
>> the application to the lowest-level database internals. Limiting correctly
>> at the client will save more resources than limiting at the native protocol
>> server, and limiting correctly at the native protocol server will save more
>> resources than limiting after we've dispatched requests to some thread pool
>> for processing.
>> 2.) We should make sure the links between the "known" root causes of
>> cascading failures and the mechanisms we introduce to avoid them remain
>> very strong.
>>
>> In any case, I'd be happy to help out in any way I can as this moves
>>

Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-02-22 Thread Jaydeep Chovatia
Thanks, Josh. I will file an official CEP with all the details in a few
days and update this thread with that CEP number.
Thanks a lot everyone for providing valuable insights!

Jaydeep

On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie  wrote:

> Do folks think we should file an official CEP and take it there?
>
> +1 here.
>
> Synthesizing your gdoc, Caleb's work, and the feedback from this thread
> into a draft seems like a solid next step.
>
> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote:
>
> I see a lot of great ideas being discussed or proposed in the past to
> cover the most common rate limiter candidate use cases. Do folks think we
> should file an official CEP and take it there?
>
> Jaydeep
>
> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe 
> wrote:
>
> I just remembered the other day that I had done a quick writeup on the
> state of compaction stress-related throttling in the project:
>
>
> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>
> I'm sure most of it is old news to the people on this thread, but I
> figured I'd post it just in case :)
>
> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie 
> wrote:
>
>
> 2.) We should make sure the links between the "known" root causes of
> cascading failures and the mechanisms we introduce to avoid them remain
> very strong.
>
> Seems to me that our historical strategy was to address individual known
> cases one-by-one rather than looking for a more holistic load-balancing and
> load-shedding solution. While the engineer in me likes the elegance of a
> broad, more-inclusive *actual SEDA-like* approach, the pragmatist in me
> wonders how far we think we are today from a stable set-point.
>
> i.e. are we facing a handful of cases where nodes can still get pushed
> over and then cascade that we can surgically address, or are we facing a
> broader lack of back-pressure that rears its head in different domains
> (client -> coordinator, coordinator -> replica, internode with other
> operations, etc) at surprising times and should be considered more
> holistically?
>
> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>
> I almost forgot CASSANDRA-15817, which introduced
> reject_repair_compaction_threshold, which provides a mechanism to stop
> repairs while compaction is underwater.
>
> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe 
> wrote:
>
> 
> Hey all,
>
> I'm a bit late to the discussion. I see that we've already discussed
> CASSANDRA-15013 <https://issues.apache.org/jira/browse/CASSANDRA-15013>
>  and CASSANDRA-16663
> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in
> passing. Having written the latter, I'd be the first to admit it's a crude
> tool, although it's been useful here and there, and provides a couple
> primitives that may be useful for future work. As Scott mentions, while it
> is configurable at runtime, it is not adaptive, although we did
> make configuration easier in CASSANDRA-17423
> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is
> global to the node, although we've lightly discussed some ideas around
> making it more granular. (For example, keyspace-based limiting, or limiting
> "domains" tagged by the client in requests, could be interesting.) It also
> does not deal with inter-node traffic, of course.
>
> Something we've not yet mentioned (that does address internode traffic) is
> CASSANDRA-17324 <https://issues.apache.org/jira/browse/CASSANDRA-17324>,
> which I proposed shortly after working on the native request limiter (and
> have just not had much time to return to). The basic idea is this:
>
> When a node is struggling under the weight of a compaction backlog and
> becomes a cause of increased read latency for clients, we have two safety
> valves:
>
>
> 1.) Disabling the native protocol server, which stops the node from
> coordinating reads and writes.
> 2.) Jacking up the severity on the node, which tells the dynamic snitch to
> avoid the node for reads from other coordinators.
>
>
> These are useful, but we don’t appear to have any mechanism that would
> allow us to temporarily reject internode hint, batch, and mutation messages
> that could further delay resolution of the compaction backlog.
>
>
> Whether it's done as part of a larger framework or on its own, it still
> feels like a good idea.
>
> Thinking in terms of opportunity costs here (i.e. where we spend our
> finite engineering time to holistically improve the experience of operating
> this database) is healthy, but we probably haven't reached the point of

Re: [Discuss] Repair inside C*

2024-02-25 Thread Jaydeep Chovatia
Thanks, Josh. I've just updated the CEP
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution>
and included all the solutions you mentioned below.

Jaydeep

On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie  wrote:

> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end of
> the day, code talks. I don't think anyone's said they'll die on the hill of
> implementation details, so that feels like CEP time to me.
>
> If you were willing and able to get a CEP together for automated repair
> based on the above material, given you've done the work and have the proof
> points it's working at scale, I think this would be a *huge contribution*
> to the community.
>
> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>
> Is anyone going to file an official CEP for this?
> As mentioned in this email thread, here is one of the solution's design
> doc
> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
> and source code on a private Apache Cassandra patch. Could you go through
> it and let me know what you think?
>
> Jaydeep
>
> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad 
> wrote:
>
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad 
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or official C* sidecar) is incredibly painful for folks.  The
> idea that your data integrity needs to be opt-in has never made sense to me
> from the perspective of either the product or the end user.
> > >
> > > I've worked with way too many teams that have either configured this
> incorrectly or not at all.
> > >
> > > Ideally Cassandra would ship with repair built in and on by default.
> Power users can dis

Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-04-10 Thread Jaydeep Chovatia
Just created an official CEP-41
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter>
incorporating the feedback from this discussion. Feel free to let me know
if I may have missed some important feedback in this thread that is not
captured in the CEP-41.

Jaydeep

On Thu, Feb 22, 2024 at 11:36 AM Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

> Thanks, Josh. I will file an official CEP with all the details in a few
> days and update this thread with that CEP number.
> Thanks a lot everyone for providing valuable insights!
>
> Jaydeep
>
> On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie 
> wrote:
>
>> Do folks think we should file an official CEP and take it there?
>>
>> +1 here.
>>
>> Synthesizing your gdoc, Caleb's work, and the feedback from this thread
>> into a draft seems like a solid next step.
>>
>> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote:
>>
>> I see a lot of great ideas being discussed or proposed in the past to
>> cover the most common rate limiter candidate use cases. Do folks think we
>> should file an official CEP and take it there?
>>
>> Jaydeep
>>
>> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe 
>> wrote:
>>
>> I just remembered the other day that I had done a quick writeup on the
>> state of compaction stress-related throttling in the project:
>>
>>
>> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>>
>> I'm sure most of it is old news to the people on this thread, but I
>> figured I'd post it just in case :)
>>
>> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie 
>> wrote:
>>
>>
>> 2.) We should make sure the links between the "known" root causes of
>> cascading failures and the mechanisms we introduce to avoid them remain
>> very strong.
>>
>> Seems to me that our historical strategy was to address individual known
>> cases one-by-one rather than looking for a more holistic load-balancing and
>> load-shedding solution. While the engineer in me likes the elegance of a
>> broad, more-inclusive *actual SEDA-like* approach, the pragmatist in me
>> wonders how far we think we are today from a stable set-point.
>>
>> i.e. are we facing a handful of cases where nodes can still get pushed
>> over and then cascade that we can surgically address, or are we facing a
>> broader lack of back-pressure that rears its head in different domains
>> (client -> coordinator, coordinator -> replica, internode with other
>> operations, etc) at surprising times and should be considered more
>> holistically?
>>
>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>
>> I almost forgot CASSANDRA-15817, which introduced
>> reject_repair_compaction_threshold, which provides a mechanism to stop
>> repairs while compaction is underwater.
>>
>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe 
>> wrote:
>>
>> 
>> Hey all,
>>
>> I'm a bit late to the discussion. I see that we've already discussed
>> CASSANDRA-15013 <https://issues.apache.org/jira/browse/CASSANDRA-15013>
>>  and CASSANDRA-16663
>> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in
>> passing. Having written the latter, I'd be the first to admit it's a crude
>> tool, although it's been useful here and there, and provides a couple
>> primitives that may be useful for future work. As Scott mentions, while it
>> is configurable at runtime, it is not adaptive, although we did
>> make configuration easier in CASSANDRA-17423
>> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is
>> global to the node, although we've lightly discussed some ideas around
>> making it more granular. (For example, keyspace-based limiting, or limiting
>> "domains" tagged by the client in requests, could be interesting.) It also
>> does not deal with inter-node traffic, of course.
>>
>> Something we've not yet mentioned (that does address internode traffic)
>> is CASSANDRA-17324
>> <https://issues.apache.org/jira/browse/CASSANDRA-17324>, which I
>> proposed shortly after working on the native request limiter (and have just
>> not had much time to return to). The basic idea is this:
>>
>> When a node is struggling under the weight of a compaction backlog and
>> becomes a cause of increased read latency for clients, we have two safety
>> valves:
>>
>>
>> 1.) Disab

Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-05-06 Thread Jaydeep Chovatia
Sure, Caleb. I will include the work as part of CASSANDRA-19534
<https://issues.apache.org/jira/browse/CASSANDRA-19534> in the CEP-41.

Jaydeep

On Fri, May 3, 2024 at 7:48 AM Caleb Rackliffe 
wrote:

> FYI, there is some ongoing sort-of-related work going on in
> CASSANDRA-19534 <https://issues.apache.org/jira/browse/CASSANDRA-19534>
>
> On Wed, Apr 10, 2024 at 6:35 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Just created an official CEP-41
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter>
>> incorporating the feedback from this discussion. Feel free to let me know
>> if I may have missed some important feedback in this thread that is not
>> captured in the CEP-41.
>>
>> Jaydeep
>>
>> On Thu, Feb 22, 2024 at 11:36 AM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Thanks, Josh. I will file an official CEP with all the details in a few
>>> days and update this thread with that CEP number.
>>> Thanks a lot everyone for providing valuable insights!
>>>
>>> Jaydeep
>>>
>>> On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie 
>>> wrote:
>>>
>>>> Do folks think we should file an official CEP and take it there?
>>>>
>>>> +1 here.
>>>>
>>>> Synthesizing your gdoc, Caleb's work, and the feedback from this thread
>>>> into a draft seems like a solid next step.
>>>>
>>>> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote:
>>>>
>>>> I see a lot of great ideas being discussed or proposed in the past to
>>>> cover the most common rate limiter candidate use cases. Do folks think we
>>>> should file an official CEP and take it there?
>>>>
>>>> Jaydeep
>>>>
>>>> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe <
>>>> calebrackli...@gmail.com> wrote:
>>>>
>>>> I just remembered the other day that I had done a quick writeup on the
>>>> state of compaction stress-related throttling in the project:
>>>>
>>>>
>>>> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>>>>
>>>> I'm sure most of it is old news to the people on this thread, but I
>>>> figured I'd post it just in case :)
>>>>
>>>> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie 
>>>> wrote:
>>>>
>>>>
>>>> 2.) We should make sure the links between the "known" root causes of
>>>> cascading failures and the mechanisms we introduce to avoid them remain
>>>> very strong.
>>>>
>>>> Seems to me that our historical strategy was to address individual
>>>> known cases one-by-one rather than looking for a more holistic
>>>> load-balancing and load-shedding solution. While the engineer in me likes
>>>> the elegance of a broad, more-inclusive *actual SEDA-like* approach,
>>>> the pragmatist in me wonders how far we think we are today from a stable
>>>> set-point.
>>>>
>>>> i.e. are we facing a handful of cases where nodes can still get pushed
>>>> over and then cascade that we can surgically address, or are we facing a
>>>> broader lack of back-pressure that rears its head in different domains
>>>> (client -> coordinator, coordinator -> replica, internode with other
>>>> operations, etc) at surprising times and should be considered more
>>>> holistically?
>>>>
>>>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>>>
>>>> I almost forgot CASSANDRA-15817, which introduced
>>>> reject_repair_compaction_threshold, which provides a mechanism to stop
>>>> repairs while compaction is underwater.
>>>>
>>>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe 
>>>> wrote:
>>>>
>>>> 
>>>> Hey all,
>>>>
>>>> I'm a bit late to the discussion. I see that we've already discussed
>>>> CASSANDRA-15013 <https://issues.apache.org/jira/browse/CASSANDRA-15013>
>>>>  and CASSANDRA-16663
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in
>>>> passing. Having written the latter, I'd be the first to admit it's a crude
>>>> tool, although it's been useful here and there, and

Real time bad query logging framework in C*

2018-06-19 Thread Jaydeep Chovatia
Hi,

We have worked on developing some common framework to detect/log
anti-patterns/bad queries in Cassandra. Target for this effort would be
to reduce burden on ops to handle Cassandra at large scale, as well as
help beginners to quickly identify performance problems with the Cassandra.
Initially we wanted to try out to make sure it really works and provides
value. we've opened JIRA with all the details. Would you please review and
provide your feedback on this effort?
https://issues.apache.org/jira/browse/CASSANDRA-14527


Thank You!!!


Jaydeep


Re: Real time bad query logging framework in C*

2018-06-20 Thread Jaydeep Chovatia
Thanks Stefan for reviewing this, please find my comments inline:


>We already provide tons of metrics and provide some useful logging (e.g.
when reading too many tombstones), but I think we should still be able to
implement further >checks in-code that highlight potentially issues. Maybe
we could >really use a framework for that, I don't know.


I agree, Cassandra already has details coming out as part of metrics,
logging (like tombstones), etc.

Current log messages for (tombstone messages, large partition message, slow
query messages, etc.) are very useful, but one important aspect missing
here is, all of these are trying to solve same problem but they are
implemented on their own (at different times) and as a result it has
duplicate code and lacks important things like changing threshold w/o
restart, commonality among log messages, have different interface so that
users can consume differently, etc. If we look at this new effort then it
is just making them common so we have a common way of doing the things in
Cassandra with more features like change threshold at runtime, commonality
in log messages, user can consume differently, etc.


>If you followed the discussions a while ago, we also talked about moving some
of the code out of Cassandra into side-car processes. Although this will
likely not >manifest for 4.0, most of the devs seem to be fond of the idea
and so am I.


I agree that side-car is very useful project but in my opinion it will be
difficult to get internal details out in realtime without modifying
Cassandra.


>Not wanting to derail this discussion (about your proposed solution), but
let me just briefly mention that I've been working on some related approach
(diagnostic events, >CASSANDRA-12944), which would allow to expose internal
events to external processes that would be able to analyze these events,
alert users, or event act on them. >It's a different approach from what
you're suggesting, but just wanted to mention this and maybe you'd agree
that having external processes for monitoring Cassandra >has some
advantages.


Thanks for sharing this, this is really useful feature and will make
operational aspect even more easy.

If we look at my proposed then it is just picking low hanging fruit, in
other words it is just rearchitecting existing logs messages like
(tombstone messages, large partition message, slow query messages, etc.)
and adding few more in generic way with more features like (one can
threshold at runtime, commonality in log messages, user can consume
differently, etc.). Idea here is we make it a framework to report these
type of messages so that all the messages (existing + new ones) will have
similarity among them.



On Wed, Jun 20, 2018 at 1:35 AM Stefan Podkowinski  wrote:

> Jaydeep, thanks for taking this discussion to the dev list. I think it's
> the best place to introduce new idea, discuss them in general and how
> they potentially fit in. As already mention in the ticket, I do share
> your assessment that we should try to improve making operational issue
> more visible to users. We already provide tons of metrics and provide
> some useful logging (e.g. when reading too many tombstones), but I think
> we should still be able to implement further checks in-code that
> highlight potentially issues. Maybe we could really use a framework for
> that, I don't know.
>
> If you followed the discussions a while ago, we also talked about moving
> some of the code out of Cassandra into side-car processes. Although this
> will likely not manifest for 4.0, most of the devs seem to be fond of
> the idea and so am I. Not wanting to derail this discussion (about your
> proposed solution), but let me just briefly mention that I've been
> working on some related approach (diagnostic events, CASSANDRA-12944),
> which would allow to expose internal events to external processes that
> would be able to analyze these events, alert users, or event act on
> them. It's a different approach from what you're suggesting, but just
> wanted to mention this and maybe you'd agree that having external
> processes for monitoring Cassandra has some advantages.
>
>
>
> On 20.06.2018 06:33, Jaydeep Chovatia wrote:
> > Hi,
> >
> > We have worked on developing some common framework to detect/log
> > anti-patterns/bad queries in Cassandra. Target for this effort would be
> > to reduce burden on ops to handle Cassandra at large scale, as well as
> > help beginners to quickly identify performance problems with the
> Cassandra.
> > Initially we wanted to try out to make sure it really works and provides
> > value. we've opened JIRA with all the details. Would you please review
> and
> > provide your feedback on this effort?
> > https://issues.apache.org/jira/browse/CASSANDRA-14

Re: [VOTE] Branching Change for 4.0 Freeze

2018-07-13 Thread Jaydeep Chovatia
+1

On Wed, Jul 11, 2018 at 2:46 PM sankalp kohli 
wrote:

> Hi,
> As discussed in the thread[1], we are proposing that we will not branch
> on 1st September but will only allow following merges into trunk.
>
> a. Bug and Perf fixes to 4.0.
> b. Critical bugs in any version of C*.
> c. Testing changes to help test 4.0
>
> If someone has a change which does not fall under these three, we can
> always discuss it and have an exception.
>
> Vote will be open for 72 hours.
>
> Thanks,
> Sankalp
>
> [1]
>
> https://lists.apache.org/thread.html/494c3ced9e83ceeb53fa127e44eec6e2588a01b769896b25867fd59f@%3Cdev.cassandra.apache.org%3E
>


Re: [VOTE] Accept GoCQL driver donation and begin incubation process

2018-09-12 Thread Jaydeep Chovatia
+1

On Wed, Sep 12, 2018 at 10:00 AM Roopa Tangirala
 wrote:

> +1
>
>
> *Regards,*
>
> *Roopa Tangirala*
>
> Engineering Manager CDE
>
> *(408) 438-3156 - mobile*
>
>
>
>
>
>
> On Wed, Sep 12, 2018 at 8:51 AM Sylvain Lebresne 
> wrote:
>
> > -0
> >
> > The project seems to have a hard time getting on top of reviewing his
> > backlog
> > of 'patch available' issues, so that I'm skeptical adopting more code to
> > maintain is the thing the project needs the most right now. Besides, I'm
> > also
> > generally skeptical that augmenting the scope of a project makes it
> better:
> > I feel
> > keeping this project focused on the core server is better. I see risks
> > here, but
> > the upsides haven't been made very clear for me, even for end users: yes,
> > it
> > may provide a tiny bit more clarity around which Golang driver to choose
> by
> > default, but I'm not sure users are that lost, and I think there is other
> > ways to
> > solve that if we really want.
> >
> > Anyway, I reckon I may be overly pessimistic here and it's not that
> strong
> > of
> > an objection if a large majority is on-board, so giving my opinion but
> not
> > opposing.
> >
> > --
> > Sylvain
> >
> >
> > On Wed, Sep 12, 2018 at 5:36 PM Jeremiah D Jordan <
> > jeremiah.jor...@gmail.com>
> > wrote:
> >
> > > +1
> > >
> > > But I also think getting this through incubation might take a while/be
> > > impossible given how large the contributor list looks…
> > >
> > > > On Sep 12, 2018, at 10:22 AM, Jeff Jirsa  wrote:
> > > >
> > > > +1
> > > >
> > > > (Incubation looks like it may be challenging to get acceptance from
> all
> > > existing contributors, though)
> > > >
> > > > --
> > > > Jeff Jirsa
> > > >
> > > >
> > > >> On Sep 12, 2018, at 8:12 AM, Nate McCall 
> wrote:
> > > >>
> > > >> This will be the same process used for dtest. We will need to walk
> > > >> this through the incubator per the process outlined here:
> > > >>
> > > >>
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__incubator.apache.org_guides_ip-5Fclearance.html&d=DwIFAg&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=g-MlYFZVJ7j5Dj_ZfPfa0Ik8Nxco7QsJhTG1TnJH7xI&s=rk5T_t1HZY6PAhN5XgflBhfEtNrcZkVTIvQxixDlw9o&e=
> > > >>
> > > >> Pending the outcome of this vote, we will create the JIRA issues for
> > > >> tracking and after we go through the process, and discuss adding
> > > >> committers in a separate thread (we need to do this atomically
> anyway
> > > >> per general ASF committer adding processes).
> > > >>
> > > >> Thanks,
> > > >> -Nate
> > > >>
> > > >>
> -
> > > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>


Re: Flakey Dtests

2017-11-27 Thread Jaydeep Chovatia
Is there a way to check which tests are failing in trunk currently?
Previously this URL  was giving such results
but is no longer working.

Jaydeep

On Wed, Nov 15, 2017 at 5:44 PM, Jeff Jirsa  wrote:

> In lieu of a weekly wrap-up, here's a pre-Thanksgiving call for help.
>
> If you haven't been paying attention to JIRA, you likely didn't notice that
> Josh went through and triage/categorized a bunch of issues by adding
> components, and Michael took the time to open a bunch of JIRAs for failing
> tests.
>
> How many is a bunch? Something like 35 or so just for tests currently
> failing on trunk.  If you're a regular contributor, you already know that
> dtests are flakey - it'd be great if a few of us can go through and fix a
> few. Even incremental improvements are improvements. Here's an easy search
> to find them:
>
> https://issues.apache.org/jira/secure/IssueNavigator.
> jspa?reset=true&jqlQuery=project+%3D+CASSANDRA+AND+
> component+%3D+Testing+ORDER+BY+updated+DESC%2C+priority+
> DESC%2C+created+ASC&mode=hide
>
> If you're a new contributor, fixing tests is often a good way to learn a
> new part of the codebase. Many of these are dtests, which live in a
> different repo ( https://github.com/apache/cassandra-dtest ) and are in
> python, but have no fear, the repo has instructions for setting up and
> running dtests(
> https://github.com/apache/cassandra-dtest/blob/master/INSTALL.md )
>
> Normal contribution workflow applies: self-assign the ticket if you want to
> work on it, click on 'start progress' to indicate that you're working on
> it, mark it 'patch available' when you've uploaded code to be reviewed (in
> a github branch, or as a standalone patch file attached to the JIRA). If
> you have questions, feel free to email the dev list (that's what it's here
> for).
>
> Many thanks will be given,
> - Jeff
>


Re: Flakey Dtests

2017-11-27 Thread Jaydeep Chovatia
This is useful info, Thanks!

Jaydeep

On Mon, Nov 27, 2017 at 2:43 PM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

> Complicated question unfortunately — and something we’re actively working
> on improving:
>
> Cassci is no longer being offered/run by Datastax and so we've need to
> come up with a new solution, and what that ultimately is is still a WIP —
> it’s loss was very huge obviously and a testament to the awesome resource
> and effort that was put into providing it to the community for all those
> years.
>
>  - Short Term/Current: Tests (both dtests and unit tests) are being run
> via the ASF Jenkins (https://builds.apache.org) - but that solution isn’t
> hugely helpful as it’s resource constrained.
>  - Short-Medium Term: we hope to get a fully baked CircleCI solution to
> get reliable fast test runs.
>  - Long Term: Actively being discussed but I’m optimistic that we can get
> something awesome for the project with some stable combination of CircleCI
> + ASF Jenkins, and once we do I’m sure this will change any long term plans.
>
> For Unit Tests (a.k.a the Java ones in tree - https://github.com/apache/
> cassandra/tree/trunk/test/unit/org/apache/cassandra):
> Take a look at https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-test/… looks like the last successful job to finish was
> #389. (https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-test/389/testReport/). There are currently a total of 6
> tests  (all from CompressedInputStreamTest) failing on trunk via ASF
> Jenkins. These specific test failures are environmental. The only *unit*
> test on trunk that I currently know to be flaky is
> org.apache.cassandra.cql3.ViewTest. testRegularColumnTimestampUpdates
> (tracked as https://issues.apache.org/jira/browse/CASSANDRA-14054)
>
> For Distributed Tests (DTests) (a.k.a the Python ones -
> https://github.com/apache/cassandra-dtest):
> The situation is a great deal more complicated due to the length of time
> and number of resources executing all of the dtests take (and executing the
> tests across the various configurations)...
>
> There are 4 dtest jobs on ASF Jenkins for trunk:
> https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-dtest/
> https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-dtest-large/
> https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-dtest-novnode/
> https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-dtest-offheap/
>
> It looks like you’ll need to go back to run #353 (
> https://builds.apache.org/view/A-D/view/Cassandra/job/
> Cassandra-trunk-dtest/353/testReport/) to see the test results as the
> last 2 jobs that were triggered failed to execute. Depending on the
> environment variables set tests are executed or skipped — so you’ll see
> different tests being run on the no-vnode job/off-heap job/regular dtest
> job (or some tests might be run multiple times)
>
>
> More recently we’ve been woking on getting CircleCI running. Some sample
> runs from my personal fork can be seen at https://circleci.com/gh/
> mkjellman/cassandra/tree/trunk_circle. I’m personally using a paid
> account to get more CircleCI resources (with 100 containers we can actually
> build the project, run all of the unit tests, and run all of the dtests in
> roughly 28 minutes!). I’m actively working to determine out exactly can
> (and cannot) be executed reliably, routinely, and easily by anyone with
> just a simple free CircleCI account.
>
> I’m also working on getting scheduled CircleCI daily runs setup against
> trunk/3.0 — more on both of those when we’ve got that story fully baked..
> Hope this answers your question! There are quite a few dtests currently
> failing and as Jeff mentioned I’ve created JIRAs for a lot of them already
> so any help (no matter how trivial or annoying it might be or seem) to get
> everything green again.
>
> best,
> kjellman
>
>
> On Nov 27, 2017, at 1:54 PM, Jaydeep Chovatia  mailto:chovatia.jayd...@gmail.com>> wrote:
>
> Is there a way to check which tests are failing in trunk currently?
> Previously this URL <http://cassci.datastax.com/> was giving such results
> but is no longer working.
>
> Jaydeep
>
> On Wed, Nov 15, 2017 at 5:44 PM, Jeff Jirsa mailto:jjirs
> a...@gmail.com>> wrote:
>
> In lieu of a weekly wrap-up, here's a pre-Thanksgiving call for help.
>
> If you haven't been paying attention to JIRA, you likely didn't notice that
> Josh went through and triage/categorized a bunch of issues by adding
> components, and Michael took the time to open a bunch of JIRAs for failing
> tests.
>
>

Re: Proposing an Apache Cassandra Management process

2018-04-12 Thread Jaydeep Chovatia
In my opinion this will be a great addition to the Cassandra and will take
overall Cassandra project to next level. This will also improve user
experience especially for new users.

Jaydeep

On Thu, Apr 12, 2018 at 2:42 PM Dinesh Joshi 
wrote:

> Hey all -
> With the uptick in discussion around Cassandra operability and after
> discussing potential solutions with various members of the community, we
> would like to propose the addition of a management process/sub-project into
> Apache Cassandra. The process would be responsible for common operational
> tasks like bulk execution of nodetool commands, backup/restore, and health
> checks, among others. We feel we have a proposal that will garner some
> discussion and debate but is likely to reach consensus.
> While the community, in large part, agrees that these features should
> exist “in the database”, there is debate on how they should be implemented.
> Primarily, whether or not to use an external process or build on
> CassandraDaemon. This is an important architectural decision but we feel
> the most critical aspect is not where the code runs but that the operator
> still interacts with the notion of a single database. Multi-process
> databases are as old as Postgres and continue to be common in newer systems
> like Druid. As such, we propose a separate management process for the
> following reasons:
>
>- Resource isolation & Safety: Features in the management process will
> not affect C*'s read/write path which is critical for stability. An
> isolated process has several technical advantages including preventing use
> of unnecessary dependencies in CassandraDaemon, separation of JVM resources
> like thread pools and heap, and preventing bugs from adversely affecting
> the main process. In particular, GC tuning can be done separately for the
> two processes, hopefully helping to improve, or at least not adversely
> affect, tail latencies of the main process.
>
>- Health Checks & Recovery: Currently users implement health checks in
> their own sidecar process. Implementing them in the serving process does
> not make sense because if the JVM running the CassandraDaemon goes south,
> the healthchecks and potentially any recovery code may not be able to run.
> Having a management process running in isolation opens up the possibility
> to not only report the health of the C* process such as long GC pauses or
> stuck JVM but also to recover from it. Having a list of basic health checks
> that are tested with every C* release and officially supported will help
> boost confidence in C* quality and make it easier to operate.
>
>- Reduced Risk: By having a separate Daemon we open the possibility to
> contribute features that otherwise would not have been considered before
> eg. a UI. A library that started many background threads and is operated
> completely differently would likely be considered too risky for
> CassandraDaemon but is a good candidate for the management process.
>
>
> What can go into the management process?
>- Features that are non-essential for serving reads & writes for eg.
> Backup/Restore or Running Health Checks against the CassandraDaemon, etc.
>
>- Features that do not make the management process critical for
> functioning of the serving process. In other words, if someone does not
> wish to use this management process, they are free to disable it.
>
> We would like to initially build minimal set of features such as health
> checks and bulk commands into the first iteration of the management
> process. We would use the same software stack that is used to build the
> current CassandraDaemon binary. This would be critical for sharing code
> between CassandraDaemon & management processes. The code should live
> in-tree to make this easy.
> With regards to more in-depth features like repair scheduling and
> discussions around compaction in or out of CassandraDaemon, while the
> management process may be a suitable host, it is not our goal to decide
> that at this time. The management process could be used in these cases, as
> they meet the criteria above, but other technical/architectural reasons may
> exists for why it should not be.
> We are looking forward to your comments on our proposal,
> Dinesh Joshi and Jordan West


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Jaydeep Chovatia
Rejecting/logging the traffic is a significant step forward, but that does
not solve the real problem. It still degrades the workload and requires
manual operator's involvement.

How about we also enhance Cassandra to automatically detect and fix the
token ownership mismatch between StorageService and Gossip cache? More
details to this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-18758

Jaydeep

On Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe 
wrote:

> Until we release TCM, it will continue to be possible for nodes to have a
> divergent view of the ring, and this means operations can still be sent to
> the wrong nodes. For example, writes may be sent to nodes that do not and
> never will own that data, and this opens us up to rather devious silent
> data loss problems.
>
> As some of you may have seen, there is a patch available for 4.0, 4.1, and
> 5.0 in CASSANDRA-13704
>  that provides a
> set of guardrails in the meantime for out-of-range operations. Essentially,
> there are two new YAML options that control whether or not to log warnings
> and/or reject operations that shouldn't have arrived at a receiving node.
>
> Given that simply logging and recording metrics isn't that invasive, the
> question we need to answer here is whether we should reject out-of-range
> operations by default, even in these patch releases. (5.0 has just barely
> been released, so I'm not sure if that really qualifies, but I digress.)
> The position I'd like to take is that this is essentially a matter of
> correctness, and we should *enable rejection by default*. (Keep in mind
> that both new options are settable at runtime via JMX.) There is precedent
> for doing something similar to this in CASSANDRA-12126
> .
>
> The one consequence of that we might discuss here is that if gossip is
> behind in notifying a node with a pending range, local rejection as it
> receives writes for that range may cause a small issue of availability.
> However, this shouldn't happen in a healthy cluster, and even if it does,
> we're simply translating a silent potential data loss bug into a transient
> but necessary availability gap with reasonable logging and visibility.
>


[Discuss] Detect+Sync Gossip&StorageService cache to improve the out-of-range tokens issue

2024-09-15 Thread Jaydeep Chovatia
Each Cassandra node keeps token ownership in two separate caches. 1) Gossip
cache 2) StorageService cache. The first Gossip cache is updated on a node,
followed by the storage service cache. In the hot path, ownership is
calculated from the storage service cache. Since two separate caches
maintain the same information, inconsistencies are bound to happen. It is
feasible that the Gossip cache has up-to-date ownership of the Cassandra
cluster, but the service cache does not, and in that scenario, inconsistent
data will be served to the user.

No mechanism in Cassandra exists that detects and fixes these two caches.
The following Ticket/PR attempts to reduce the out-of-range token issues.

*Long-term solution:* The long-term solution is TSM (CEP-21), but it might
take some time as it will be in 5.1, and the folks currently on 4.x could
have to wait for some time before taking full advantage of TSM. In the
interim, this PR provides some relief!

*Ticket:* https://issues.apache.org/jira/browse/CASSANDRA-18758
*PR (on 4.1):* https://github.com/apache/cassandra/pull/3548

Jaydeep


Cassandra 3.0 - A new node is not able to join the cluster

2022-05-02 Thread Jaydeep Chovatia
Hi,

I've a production Cassandra cluster from 3.0.14 branch. Each node consists
of roughly 1.5TB data with a ring size 70+70. I need to add more capacity
to meet production demand, but when I add 71st node, then it streams data
from other nodes as expected, but after some-time it spends an enormous
amount of time doing GCs, and freezes.

Snippet from the log file...

{"@timestamp":"2022-04-23T02:21:38.030+00:00","@version":1,"message":"G1
Old Generation GC in 18288ms.  G1 Old Gen: 34206148032 -> 33725019032;
","logger_name":"o.a.c.service.GCInspector","thread_name":"Service
Thread","level":"WARN","level_value":3}


The new node has around 1.5M SSTables (from multiple tables). If I reduce
the SSTable count to below 500K, then it joins fine.
I am using LCS compaction with default settings. I've tried changing to
STCS, but no luck :(

Any help would be highly appreciated. Thanks a lot!

Jaydeep


Cassandra Token ownership split-brain (3.0.14)

2022-09-02 Thread Jaydeep Chovatia
Hi,

We are running a production Cassandra version (3.0.14) with 256 tokens
v-node configuration. Occasionally, we see that different nodes show
different ownership for the same key. Only a node restart corrects;
otherwise, it continues to behave in a split-brain.

Say, for example,

*NodeA*
nodetool getendpoints ks1 table1 10
- n1
- n2
- n3

*NodeB*
nodetool getendpoints ks1 table1 10
- n1
- n2
*- n5*

If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The
majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a
few show this issue, and restarting them solves the problem.

To me, it seems I think Cassandra's Gossip cache and StorageService cache
(TokenMetadata) are having some sort of cache coherence.

Anyone has observed this behavior?
Any help would be highly appreciated.

Jaydeep


Re: Cassandra Token ownership split-brain (3.0.14)

2022-09-06 Thread Jaydeep Chovatia
If anyone has seen this issue and knows a fix, it would be a great help!
Thanks in advance.

Jaydeep

On Fri, Sep 2, 2022 at 1:56 PM Jaydeep Chovatia 
wrote:

> Hi,
>
> We are running a production Cassandra version (3.0.14) with 256 tokens
> v-node configuration. Occasionally, we see that different nodes show
> different ownership for the same key. Only a node restart corrects;
> otherwise, it continues to behave in a split-brain.
>
> Say, for example,
>
> *NodeA*
> nodetool getendpoints ks1 table1 10
> - n1
> - n2
> - n3
>
> *NodeB*
> nodetool getendpoints ks1 table1 10
> - n1
> - n2
> *- n5*
>
> If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The
> majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a
> few show this issue, and restarting them solves the problem.
>
> To me, it seems I think Cassandra's Gossip cache and StorageService cache
> (TokenMetadata) are having some sort of cache coherence.
>
> Anyone has observed this behavior?
> Any help would be highly appreciated.
>
> Jaydeep
>


Re: Cassandra Token ownership split-brain (3.0.14)

2022-09-06 Thread Jaydeep Chovatia
Thanks Scott. I will prioritize upgrading to 3.0.27 and will circle back if
this issue persists.

Jaydeep


On Tue, Sep 6, 2022 at 3:45 PM C. Scott Andreas 
wrote:

> Hi Jaydeep,
>
> Thanks for reaching out and for bumping this thread.
>
> This is probably not the answer you’re after, but mentioning as it may
> address the issue.
>
> C* 3.0.14 was released over five years ago, with many hundreds of
> important bug fixes landing since July 2017. These include fixes for issues
> that have affected gossip in the past which may be related to this issue.
> Note that 3.0.14 also is susceptible to several critical data loss bugs
> including C-14513 and C-14515.
>
> I’d strongly recommend upgrading to Cassandra 3.0.27 as a starting point.
> If this doesn’t resolve your issue, members of the community may be in a
> better position to help triage a bug report against a current release of
> the database.
>
> - Scott
>
> On Sep 6, 2022, at 5:13 PM, Jaydeep Chovatia 
> wrote:
>
> 
> If anyone has seen this issue and knows a fix, it would be a great help!
> Thanks in advance.
>
> Jaydeep
>
> On Fri, Sep 2, 2022 at 1:56 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hi,
>>
>> We are running a production Cassandra version (3.0.14) with 256 tokens
>> v-node configuration. Occasionally, we see that different nodes show
>> different ownership for the same key. Only a node restart corrects;
>> otherwise, it continues to behave in a split-brain.
>>
>> Say, for example,
>>
>> *NodeA*
>> nodetool getendpoints ks1 table1 10
>> - n1
>> - n2
>> - n3
>>
>> *NodeB*
>> nodetool getendpoints ks1 table1 10
>> - n1
>> - n2
>> *- n5*
>>
>> If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The
>> majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a
>> few show this issue, and restarting them solves the problem.
>>
>> To me, it seems I think Cassandra's Gossip cache and StorageService cache
>> (TokenMetadata) are having some sort of cache coherence.
>>
>> Anyone has observed this behavior?
>> Any help would be highly appreciated.
>>
>> Jaydeep
>>
>


Cassandra 3.0.27 - Tombstone disappeared during node replacement

2022-11-14 Thread Jaydeep Chovatia
Hi,

I am running Cassandra 3.0.27 in my production. In some corner case
scenarios, we have tombstones disappearing during bootstrap/decommission.
I've outlined a possible theory with the root cause in this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-17991

Could someone please help validate this?

Jaydeep


Re: [Discuss] Repair inside C*

2024-10-18 Thread Jaydeep Chovatia
Indeed, Mike - two additional weeks is not an issue at all.
Thanks!

Jaydeep


Re: [Discuss] Repair inside C*

2024-10-18 Thread Jaydeep Chovatia
Mick, I am highly sorry to mispronounce your name.

Indeed, Mick  - two additional weeks is not an issue at all.

Jaydeep

On Fri, Oct 18, 2024 at 1:41 PM Jaydeep Chovatia 
wrote:

> Indeed, Mike - two additional weeks is not an issue at all.
> Thanks!
>
> Jaydeep
>


Re: [Discuss] Repair inside C*

2024-10-21 Thread Jaydeep Chovatia
>Jaydeep, do you have any metrics on your clusters comparing them before
and after introducing repair scheduling into the Cassandra process?

Yes, I had made some comparisons when I started rolling this feature out to
our production five years ago :)  Here are the details:
*The Scheduling*
The scheduling itself is exceptionally lightweight, as only one additional
thread monitors the repair activity, updating the status to a system table
once every few minutes or so. So, it does not appear anywhere in the CPU
charts, etc. Unfortunately, I do not have those graphs now, but I can do a
quick comparison if it helps!

*The Repair Itself*
As we all know, the Cassandra repair algorithm is a heavy-weight process
due to Merkle tree/streaming, etc., no matter how we schedule it. But it is
an orthogonal topic and folks are already discussing creating a new CEP.

Jaydeep


On Mon, Oct 21, 2024 at 10:02 AM Francisco Guerrero 
wrote:

> Jaydeep, do you have any metrics on your clusters comparing them before
> and after introducing repair scheduling into the Cassandra process?
>
> On 2024/10/21 16:57:57 "J. D. Jordan" wrote:
> > Sounds good. Just wanted to bring it up. I agree that the scheduling bit
> is
> > pretty light weight and the ideal would be to bring the whole of the
> repair
> > external, which is a much bigger can of worms to open.
> >
> >
> >
> > -Jeremiah
> >
> >
> >
> > > On Oct 21, 2024, at 11:21 AM, Chris Lohfink 
> wrote:
> > >
> > >
> >
> > > 
> > >
> > > > I actually think we should be looking at how we can move things out
> of the
> > > database process.
> > >
> > >
> > >
> > >
> > >
> > > While worth pursuing, I think we would need a different CEP just to
> figure
> > > out how to do that. Not only is there a lot of infrastructure
> difficulty in
> > > running multi process, the inter app communication needs to be figured
> out
> > > better then JMX. Even the sidecar we dont have a solid story on how to
> > > ensure both are running or anything yet. It's up to each app owner to
> figure
> > > it out. Once we have a good thing in place I think we can start moving
> > > compactions, repairs, etc out of the database. Even then it's the
> _repairs_
> > > that is expensive, not the scheduling.
> > >
> > >
> > >
> > >
> > > On Mon, Oct 21, 2024 at 9:45 AM Jeremiah Jordan
> > > <[jeremiah.jor...@gmail.com](mailto:jeremiah.jor...@gmail.com)>
> wrote:
> > >
> > >
> >
> > >> I love the idea of a repair service being there by default for an
> install
> > of C*.  My main concern here is that it is putting more services into
> the main
> > database process.  I actually think we should be looking at how we can
> move
> > things out of the database process.  The C* process being a giant
> monolith has
> > always been a pain point.  Is there anyway it makes sense for this to be
> an
> > external process rather than a new thread pool inside the C* process?
> >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> -Jeremiah Jordan
> >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> On Oct 18, 2024 at 2:58:15 PM, Mick Semb Wever
> > <[m...@apache.org](mailto:m...@apache.org)> wrote:
> > >
> > >>
> >
> > >>>
> > >
> > >>>
> >
> > >>> This is looking strong, thanks Jaydeep.
> >
> > >>>
> >
> > >>>
> > >
> > >>>
> >
> > >>> I would suggest folk take a look at the design doc and the PR in the
> CEP.
> > A lot is there (that I have completely missed).
> >
> > >>>
> >
> > >>>
> > >
> > >>>
> >
> > >>> I would especially ask all authors of prior art (Reaper, DSE
> nodesync,
> > ecchronos)  to take a final review of the proposal
> > >
> > >>>
> >
> > >>>
> > >
> > >>>
> >
> > >>> Jaydeep, can we ask for a two week window while we reach out to these
> > people ?  There's a lot of prior art in this space, and it feels like
> we're in
> > a good place now where it's clear this has legs and we can use that to
> bring
> > folk in and make sure there's no remaining blindspots.
> >
> > >>>
> >
> &g

Re: [Discuss] Repair inside C*

2024-10-17 Thread Jaydeep Chovatia
Sorry, there is a typo in the CEP-37 link; here is the correct link
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution>


On Thu, Oct 17, 2024 at 4:36 PM Jaydeep Chovatia 
wrote:

> First, thank you for your patience while we strengthened the CEP-37.
>
>
> Over the last eight months, Chris Lohfink, Andy Tolbert, Josh McKenzie,
> Dinesh Joshi, Kristijonas Zalys, and I have done tons of work (online
> discussions/a dedicated Slack channel #cassandra-repair-scheduling-cep37)
> to come up with the best possible design that not only significantly
> simplifies repair operations but also includes the most common features
> that everyone will benefit from running at Scale.
>
> For example,
>
>-
>
>Apache Cassandra must be capable of running multiple repair types,
>such as Full, Incremental, Paxos, and Preview - so the framework should be
>easily extendable with no additional overhead from the operator’s point of
>view.
>-
>
>An easy way to extend the token-split calculation algorithm with a
>default implementation should exist.
>-
>
>Running incremental repair reliably at Scale is pretty challenging, so
>we need to place safeguards, such as migration/rollback w/o restart and
>stopping incremental repair automatically if the disk is about to get full.
>
> We are glad to inform you that CEP-37 (i.e., Repair inside Cassandra) is
> now officially ready for review after multiple rounds of design, testing,
> code reviews, documentation reviews, and, more importantly, validation that
> it runs at Scale!
>
>
> Some facts about CEP-37.
>
>-
>
>Multiple members have verified all aspects of CEP-37 numerous times.
>-
>
>The design proposed in CEP-37 has been thoroughly tried and tested on
>an immense scale (hundreds of unique Cassandra clusters, tens of thousands
>of Cassandra nodes, with tens of millions of QPS) on top of 4.1 open-source
>for more than five years; please see more details here
>
> <https://www.uber.com/en-US/blog/how-uber-optimized-cassandra-operations-at-scale/>
>.
>-
>
>The following presentation
>
> <https://docs.google.com/presentation/d/1Zilww9c7LihHULk_ckErI2s4XbObxjWknKqRtbvHyZc/edit#slide=id.g30a4fd4fcf7_0_13>
>highlights the rigorous applied to CEP-37, which was given during last
>week’s Apache Cassandra Bay Area Meetup
><https://www.meetup.com/apache-cassandra-bay-area/events/303469006/>,
>
>
> Since things are massively overhauled, we believe it is almost ready for a
> final pass pre-VOTE. We would like you to please review the CEP-37
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution)>
> and the associated detailed design doc
> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
> .
>
> Thank you everyone!
>
> Chris, Andy, Josh, Dinesh, Kristijonas, and Jaydeep
>
>
>
> On Thu, Sep 19, 2024 at 11:26 AM Josh McKenzie 
> wrote:
>
>> Not quite; finishing touches on the CEP and design doc are in flight (as
>> of last / this week).
>>
>> Soon(tm).
>>
>> On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote:
>>
>> Is this CEP ready for a VOTE thread?
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution
>>
>> On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>> Thanks, Josh. I've just updated the CEP
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution>
>> and included all the solutions you mentioned below.
>>
>> Jaydeep
>>
>> On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie 
>> wrote:
>>
>>
>> Very late response from me here (basically necro'ing this thread).
>>
>> I think it'd be useful to get this condensed into a CEP that we can then
>> discuss in that format. It's clearly something we all agree we need and
>> having an implementation that works, even if it's not in your preferred
>> execution domain, is vastly better than nothing IMO.
>>
>> I don't have cycles (nor background ;) ) to do that, but it sounds like
>> you do Jaydeep given the implementation you have on a private fork + design.
>>
>> A non-exhaustive list of things that might be useful incorporating into
>> or referencing from a CEP:
>> Slack thread:
>> https://the-asf.slack.com/arch

Re: [Discuss] Repair inside C*

2024-10-17 Thread Jaydeep Chovatia
First, thank you for your patience while we strengthened the CEP-37.


Over the last eight months, Chris Lohfink, Andy Tolbert, Josh McKenzie,
Dinesh Joshi, Kristijonas Zalys, and I have done tons of work (online
discussions/a dedicated Slack channel #cassandra-repair-scheduling-cep37)
to come up with the best possible design that not only significantly
simplifies repair operations but also includes the most common features
that everyone will benefit from running at Scale.

For example,

   -

   Apache Cassandra must be capable of running multiple repair types, such
   as Full, Incremental, Paxos, and Preview - so the framework should be
   easily extendable with no additional overhead from the operator’s point of
   view.
   -

   An easy way to extend the token-split calculation algorithm with a
   default implementation should exist.
   -

   Running incremental repair reliably at Scale is pretty challenging, so
   we need to place safeguards, such as migration/rollback w/o restart and
   stopping incremental repair automatically if the disk is about to get full.

We are glad to inform you that CEP-37 (i.e., Repair inside Cassandra) is
now officially ready for review after multiple rounds of design, testing,
code reviews, documentation reviews, and, more importantly, validation that
it runs at Scale!


Some facts about CEP-37.

   -

   Multiple members have verified all aspects of CEP-37 numerous times.
   -

   The design proposed in CEP-37 has been thoroughly tried and tested on an
   immense scale (hundreds of unique Cassandra clusters, tens of thousands of
   Cassandra nodes, with tens of millions of QPS) on top of 4.1 open-source
   for more than five years; please see more details here
   
<https://www.uber.com/en-US/blog/how-uber-optimized-cassandra-operations-at-scale/>
   .
   -

   The following presentation
   
<https://docs.google.com/presentation/d/1Zilww9c7LihHULk_ckErI2s4XbObxjWknKqRtbvHyZc/edit#slide=id.g30a4fd4fcf7_0_13>
   highlights the rigorous applied to CEP-37, which was given during last
   week’s Apache Cassandra Bay Area Meetup
   <https://www.meetup.com/apache-cassandra-bay-area/events/303469006/>,


Since things are massively overhauled, we believe it is almost ready for a
final pass pre-VOTE. We would like you to please review the CEP-37
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution)>
and the associated detailed design doc
<https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
.

Thank you everyone!

Chris, Andy, Josh, Dinesh, Kristijonas, and Jaydeep



On Thu, Sep 19, 2024 at 11:26 AM Josh McKenzie  wrote:

> Not quite; finishing touches on the CEP and design doc are in flight (as
> of last / this week).
>
> Soon(tm).
>
> On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote:
>
> Is this CEP ready for a VOTE thread?
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution
>
> On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
> Thanks, Josh. I've just updated the CEP
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution>
> and included all the solutions you mentioned below.
>
> Jaydeep
>
> On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie 
> wrote:
>
>
> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end

Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
Thanks, Mick, for the comment, please find my response below.

>(1)

I think I covered most of the points in my response to Alexander (except
one, which I am responding to below separately). Tl;dr is the MVP that can
be easily extended to do a table-level schedule; it is just going to be
another CQL table property as opposed to a yaml config (currently in MVP).
I had already added this as a near-term feature here and added that when we
add repair priority table-wise, we need to ensure the table-level
scheduling is also taken care of. Please visit my latest few comments to
the ticket https://issues.apache.org/jira/browse/CASSANDRA-20013

>You may also want to do repairs in different DCs differently.

Currently, the MVP allows one to skip one or more DCs if they wish to do so
by defaulting all DCs. This again points to the similar theme of allowing
schedule (or priority) at a table level followed by a DC level. The MVP can
be easily extended at whatever granularity we want scheduling to be without
many architectural changes. We all just have to finalize the granularity we
want. I've also added to the ticket above that scheduling support at a
table-level followed by DC-level granularity.

>I'm curious as to how crashed repairs are handled and resumed

The MVP has a max allowed quota at a keyspace level and at a table level.
So, if a repair and/or keyspace takes much longer than the timeout due to
failures/more data it needs to repair, etc., then it will skip to the next
table/keyspace.

>Without any per-table scheduling and history (IIUC)  a node would have to
restart the repairs for all keyspaces and tables.

The above-mentioned quote should work fine and will make sure the bad
tables/keyspaces are skipped, allowing the good keyspaces/tables to proceed
on a node as long as the Cassandra JVM itself keeps crashing. If a JVM
keeps crashing, then it will restart all over again, but then fixing the
JVM crashing might be a more significant issue and does not happen
regularly, IMO.

>And without such per-table tracking, I'm also kinda curious as to how we
interact with manual repair invocations the user makes.  There are
operational requirements to do manual repairs, e.g. node replacement or if
a node has been down for too long, and consistency breakages until such
repair is complete.  Leaving such operational requirements to this CEP's
in-built scheduler is a limited approach, it may be many days before it
gets to doing it, and even with node priority will it appropriately switch
from primary-range to all-replica-ranges?

To alleviate some of this, the MVP has two options one can configure
dynamically through *nodetool*: 1) Setting priority for nodes, 2) Telling
the scheduler to repair one or more nodes immediately
If an admin sets some nodes on a priority queue, those nodes will be
repaired over the scheduler's own list. If an admin tags some nodes on the
emergency list, then those nodes will repair immediately. Basically, an
admin tells the scheduler, "*Just do what I say instead of using your list
of nodes*".
Even with this, if an admin decides to trigger repair manually directly
through *nodetool repair*, then the scheduler should not interfere with
that manually triggered operation - they can progress independently. The
MVP has options to disable the scheduler's repair dynamically without any
cluster restart, etc., so the admin can use some of the combinations and
decide what to do when they invoke any manual repair operation.

>What if the user accidentally invokes an incremental repair when the
in-built scheduler is expecting only to ever perform full repairs? Does it
know how to detect/remedy that?

The user invocation and the scheduler invocations go through two different
Repair sessions. If the MVP scheduler has been configured only to perform
FR, then the scheduler will never fire IR, but it does not prohibit the
user from firing IR through *nodetool repair*. As an enhancement to the
MVP, in the future, we must warn the user that it might not be safe to run
IR as the in-built scheduler has been configured not to do IR, etc., so be
careful, etc.

>Having read the design doc and PR, I am impressed how lightweight the
design of the tables are.

Thanks. To reiterate, the number of records in the system_distributed will
be equivalent to the number of nodes in the Cluster.

>But I do still think we deserve some numbers, and a further line of
questioning:  what consistency guarantees do we need, how does this work
cross-dc, during topology changes, does an event that introduces
data-at-rest inconsistencies in the cluster then become
confused/inefficient when the mechanism to repair it also now has its
metadata inconsistent.  For the most part this is a problem not unique to
any table in system_distributed and otherwise handled, but how does the
system_distributed keyspace handling of such failures impact repairs.

Keeping practicality in mind, the record count to the table should be as
small as three rows and 

Re: [Discuss] Repair inside C*

2024-10-29 Thread Jaydeep Chovatia
>Repairs causing a node to OOM is not unusual.   I've been working with a
customer in this situation the past few weeks.  Getting fixes out, or
mitigating the problem, is not always as quick as one hopes (see my
previous comment about how the repair_session_size setting gets easily
clobbered today).  This situation would be much improved with table
priority and tracking is added to the system_distributed table(s).
I agree we would need to tackle this OOM / JVM crashing scenario
eventually, but on the other hand, adding table-level tracking looks easy
but to perfect it, it would take some effort, say we would have to handle
all corner-case scenarios, such as cleaning the state metadata, what would
happen if there is a race condition that table was dropped but metadata
could not. Architecture extension is simple, but making it bug-free and
robust is a bit complex.

>Does this emergency list imply then not doing --partitioner-range ?
The emergency list is to prioritize a few nodes over others, but those
nodes will continue to honor the same repair configuration that has been
provided. The default configuration is to repair primary token ranges only.

>For per-table custom-priorities and tracking it sounds like adding a
clustering column.  So the number of records would go from ~number of nodes
in the cluster, to ~number of nodes multiplied by up to the number of
tables in the cluster.  We do see clusters too often with up to a thousand
tables, despite strong recommendations not to go over two hundred.  Do you
see here any concern ?
My initial thoughts are to add this as a CQL table property, something like
"repair_pririty=0.0", with all tables having the same priority. But the
user can change the priority through ALTER, say, "ALTER TABLE T1 WITH
repair_pririty=0.1", then T1 will be prioritized over other tables. Again,
I need to give more thought to it and need to do a small discussion either
in a bi-weekly meeting or on a ticket to ensure all folks are on the same
page. If we go with this approach, we do not need to add any additional
columns to the repair metadata tables, so that way the design continues to
remain lightweight, etc.
For a moment, let's just assume we add a new clustering column to track
tables. After that, the number of rows will be =  *
, which is still not an issue. As I mentioned above, the
bigger problem for table-tracking is not the architecture extension, but
perfecting with all race conditions is a bit complex.

>Also, in what versions will we be able to introduce such improvements ? We
will be waiting until the next major release ?  Playing around with the
schema of system tables in release branches is not much fun.
There are a few items on the priority beyond the CEP-37 MVP scope that some
of us are working on:
1. Extend disk-capacity check for full repair - Jaydeep
2. Making the incremental repair more reliable by having an unrepaired
size-based token splitter - Andy T, Chris L
3. Add support for the Preview Repair - Kristijonas
4. Start a new ML discussion gauge consensus on whether repairs should be
backward/forwards compatible between major versions in the future - Andy T

On top of the above list, here is my recommendation (this is just a pure
thought only and subject to change depending on how all the community
members see it):

   - Nov-Dec: We can definitely prioritize the table-level priority
   feature, which would address many concerns - Jaydeep (I can take the lead
   for a small discussion followed by implementation)
   - Nov-Feb: For table-level tracking, we can divide it into two parts:
  - (Part-1) Nov-Dec: A video meeting discussion among a few of us and
  see how we want to design, etc. -  Jaydeep
  - (Part-2) Dec-Feb: Based on the above design, implement accordingly
  - *TODO*


Jaydeep


On Tue, Oct 29, 2024 at 12:06 PM Mick Semb Wever  wrote:

>
> Jaydeep,
>   your replies address my main concerns, there's a few questions of
> curiosity as replies inline below…
>
>
>
>
>
>> >Without any per-table scheduling and history (IIUC)  a node would have
>> to restart the repairs for all keyspaces and tables.
>>
>> The above-mentioned quote should work fine and will make sure the bad
>> tables/keyspaces are skipped, allowing the good keyspaces/tables to proceed
>> on a node as long as the Cassandra JVM itself keeps crashing. If a JVM
>> keeps crashing, then it will restart all over again, but then fixing the
>> JVM crashing might be a more significant issue and does not happen
>> regularly, IMO.
>>
>
>
> Repairs causing a node to OOM is not unusual.   I've been working with a
> customer in this situation the past few weeks.  Getting fixes out, or
> mitigating the problem, is not always as quick as one hopes (see my
> previous comment about how the repair_session_size setting gets easily
> clobbered today).  This situation would be much improved with table
> priority and tracking is added to the system_distributed table(s).
>
>
>
>> If an admin sets some no

Re: [Discuss] Repair inside C*

2024-10-29 Thread Jaydeep Chovatia
>Since the auto repair is running from within Cassandra, we might have more
control over this and implement a proper cleanup of such snapshots.
Rightly said, Alexander. Having internal knowledge of Cassandra, we can do
a lot more. For example, for better Incremental Reliability reliability,
Andy T and Chris L have developed a new token-split algorithm on top of the
MVP based on unrepaired data in SSTables (soon it will be added to the MVP
as they are working on writing test cases, etc.), and that
requires internal SSTable data-structure access, etc.

Jaydeep

On Mon, Oct 28, 2024 at 10:51 PM Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

>
>> > That's inaccurate, we can check the replica set for the subrange we're
> about to run and see if it overlaps with the replica set of other ranges
> which are being processed already.
> We can definitely check the replicas for the subrange we plan to run and
> see if they overlap with the ongoing one. I am saying that for a smaller
> cluster if we want to repair multiple token ranges in parallel, it is tough
> to guarantee that replica sets won't overlap.
>
> >Jira to auto-delete snapshots at X% disk full ?
> Sure, just created a new JIRA
> https://issues.apache.org/jira/browse/CASSANDRA-20035
>
> Jaydeep
>


Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
>
>
> > That's inaccurate, we can check the replica set for the subrange we're
about to run and see if it overlaps with the replica set of other ranges
which are being processed already.
We can definitely check the replicas for the subrange we plan to run and
see if they overlap with the ongoing one. I am saying that for a smaller
cluster if we want to repair multiple token ranges in parallel, it is tough
to guarantee that replica sets won't overlap.

>Jira to auto-delete snapshots at X% disk full ?
Sure, just created a new JIRA
https://issues.apache.org/jira/browse/CASSANDRA-20035

Jaydeep


[VOTE] CEP-37: Repair scheduling inside C*

2024-11-05 Thread Jaydeep Chovatia
Hi Everyone,

I would like to start the voting for CEP-37 as all the feedback in the
discussion thread seems to be addressed.

Proposal: CEP-37 Repair Scheduling Inside Cassandra

Discussion thread:
https://lists.apache.org/thread/nl8rmsyxxovryl3nnlt4mzrj9t0x66ln

As per the CEP process documentation, this vote will be open for 72 hours
(longer if needed).

Thanks,
Jaydeep


Re: [Discuss] Repair inside C*

2024-10-30 Thread Jaydeep Chovatia
Thanks for the kind words, Mick!

Jaydeep

On Wed, Oct 30, 2024 at 1:35 AM Mick Semb Wever  wrote:

> Thanks Jaydeep.  I've exhausted my lines on enquiry and am happy that
> thought has gone into them.
>
>
>
>> On top of the above list, here is my recommendation (this is just a pure
>> thought only and subject to change depending on how all the community
>> members see it):
>>
>>- Nov-Dec: We can definitely prioritize the table-level priority
>>feature, which would address many concerns - Jaydeep (I can take the lead
>>for a small discussion followed by implementation)
>>- Nov-Feb: For table-level tracking, we can divide it into two parts:
>>   - (Part-1) Nov-Dec: A video meeting discussion among a few of us
>>   and see how we want to design, etc. -  Jaydeep
>>   - (Part-2) Dec-Feb: Based on the above design, implement
>>   accordingly - *TODO*
>>
>>
>
> It's the discussion that's important to me, that everyone involved has
> taken onboard the input and has a PoV on it.
>
>


Re: [Discuss] Repair inside C*

2024-11-01 Thread Jaydeep Chovatia
FYI..I've updated the CEP-37 content to include all the improvements we
have discussed over this ML, ongoing work, and future work. I've also
tagged JIRAs for each ongoing/future work and assigned to some individuals
with a tentative timeline.
Thanks, Mick, for suggesting capturing the ML discussion in the CEP-37!

Jaydeep


Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
posed design and have a few comments/questions.
> As one of the maintainers of Reaper, I'm looking this through the lens of
> how Reaper does things.
>
>
> *The approach taken in the CEP-37 design is "node-centric" vs a "range
> centric" approach (which is the one Reaper takes).*I'm worried that this
> will not allow spreading the repair load evenly across the cluster, since
> nodes are the concurrency unit. You could allow running repair on 3 nodes
> concurrently for example, but these 3 nodes could all involve the same
> replicas, making these replicas process 3 concurrent repairs while others
> could be left uninvolved in any repair at all.
> Taking a range centric approach (we're not repairing nodes, we're
> repairing the token ranges) allows to spread the load evenly without
> overlap in the replica sets.
> I'm more worried even with incremental repair here, because you might end
> up with some conflicts around sstables which would be in the pending repair
> pool but would be needed by a competing repair job.
> I don't know if in the latest versions such sstables would be totally
> ignored or if the competing repair job would fail.
>
> *Each repair command will repair all keyspaces (with the ability to fully
> exclude some tables) and **I haven't seen a notion of schedule which
> seems to suggest repairs are running continuously (unless I missed
> something?).*
> There are many cases where one might have differentiated gc_grace_seconds
> settings to optimize reclaiming tombstones when applicable. That requires
> having some fine control over the repair cycle for a given keyspace/set of
> tables.
> Here, nodes will be processed sequentially and each node will process the
> keyspaces sequentially, tying the repair cycle of all keyspaces together.
> If one of the ranges for a specific keyspace cannot be repaired within the
> 3 hours timeout, it could block all the other keyspaces repairs.
> Continuous repair might create a lot of overhead for full repairs which
> often don't require more than 1 run per week.
> It also will not allow running a mix of scheduled full/incremental repairs
> (I'm unsure if that is still a recommendation, but it was still recommended
> not so long ago)
>
> *The timeout base duration is large*
> I think the 3 hours timeout might be quite large and probably means a lot
> of data is being repaired for each split. That usually involves some level
> of overstreaming. I don't have numbers to support this, it's more about my
> own experience on sizing splits in production with Reaper to reduce the
> impact as much as possible on cluster performance.
> We use 30 minutes as default in Reaper with subsequent attempts growing
> the timeout dynamically for challenging splits.
>
> Finally thanks for picking this up, I'm eager to see Reaper not being
> needed anymore and having the database manage its own repairs!
>
>
> Le mar. 22 oct. 2024 à 21:10, Benedict  a écrit :
>
>> I realise it’s out of scope, but to counterbalance all of the
>> pro-decomposition messages I wanted to chime in with a strong -1. But we
>> can debate that in a suitable context later.
>>
>> On 22 Oct 2024, at 16:36, Jordan West  wrote:
>>
>> 
>> Agreed with the sentiment that decomposition is a good target but out of
>> scope here. I’m personally excited to see an in-tree repair scheduler and
>> am supportive of the approach shared here.
>>
>> Jordan
>>
>> On Tue, Oct 22, 2024 at 08:12 Dinesh Joshi  wrote:
>>
>>> Decomposing Cassandra may be architecturally desirable but that is not
>>> the goal of this CEP. This CEP brings value to operators today so it should
>>> be considered on that merit. We definitely need to have a separate
>>> conversation on Cassandra's architectural direction.
>>>
>>> On Tue, Oct 22, 2024 at 7:51 AM Joseph Lynch 
>>> wrote:
>>>
>>>> Definitely like this in C* itself. We only changed our proposal to
>>>> putting repair scheduling in the sidecar before because trunk was frozen
>>>> for the foreseeable future at that time. With trunk unfrozen and
>>>> development on the main process going at a fast pace I think it makes way
>>>> more sense to integrate natively as table properties as this CEP proposes.
>>>> Completely agree the scheduling overhead should be minimal.
>>>>
>>>> Moving the actual repair operation (comparing data and streaming
>>>> mismatches) along with compaction operations to a separate process long
>>>> term makes a lot of sense but imo only once we both 

Re: [VOTE] CEP-37: Repair scheduling inside C*

2024-11-12 Thread Jaydeep Chovatia
Voting passes with 16 +1s (5 nb) and no -1. Closing this thread now.
Thank you everyone!

Jaydeep

On Sun, Nov 10, 2024 at 7:11 AM J. D. Jordan 
wrote:

> +1 nb
>
> On Nov 9, 2024, at 9:57 PM, Vinay Chella  wrote:
>
> +1
>
>
>
> Thanks,
> Vinay Chella
>
>
> On Sat, Nov 9, 2024 at 1:31 PM Ekaterina Dimitrova 
> wrote:
>
>> +1
>>
>> On Sat, 9 Nov 2024 at 15:56, Sumanth Pasupuleti <
>> sumanth.pasupuleti...@gmail.com> wrote:
>>
>>> +1 (nb)
>>>
>>> On Sat, Nov 9, 2024 at 12:43 PM Joseph Lynch 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, Nov 8, 2024 at 10:42 PM Jordan West  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Wed, Nov 6, 2024 at 11:15 Chris Lohfink 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Wed, Nov 6, 2024 at 11:10 AM Francisco Guerrero <
>>>>>> fran...@apache.org> wrote:
>>>>>>
>>>>>>> +1 (nb)
>>>>>>>
>>>>>>> On 2024/11/06 14:07:47 "Tolbert, Andy" wrote:
>>>>>>> > +1 (nb)
>>>>>>> >
>>>>>>> > On Tue, Nov 5, 2024 at 9:51 PM Josh McKenzie 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > +1
>>>>>>> > >
>>>>>>> > > On Tue, Nov 5, 2024, at 4:28 PM, Jaydeep Chovatia wrote:
>>>>>>> > >
>>>>>>> > > Hi Everyone,
>>>>>>> > >
>>>>>>> > > I would like to start the voting for CEP-37 as all the feedback
>>>>>>> in the
>>>>>>> > > discussion thread seems to be addressed.
>>>>>>> > >
>>>>>>> > > Proposal: CEP-37 Repair Scheduling Inside Cassandra
>>>>>>> > > <
>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution
>>>>>>> >
>>>>>>> > > Discussion thread:
>>>>>>> > > https://lists.apache.org/thread/nl8rmsyxxovryl3nnlt4mzrj9t0x66ln
>>>>>>> > >
>>>>>>> > > As per the CEP process documentation, this vote will be open for
>>>>>>> 72 hours
>>>>>>> > > (longer if needed).
>>>>>>> > >
>>>>>>> > > Thanks,
>>>>>>> > > Jaydeep
>>>>>>> > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
There are two approaches I have been thinking about for MV.

*1. **Short Term (**Status Quo)*
Here, we do not improve Cassandra MV architecture such that it reduces the
data inconsistencies drastically; thus, we continually mark MV as an
experimental feature.

In this case, we can have two suboptions to make the data consistent
eventually.

   1. Use external tools (such as Spark)
   2. Do it internally within Cassandra

*2. **Long Term (**Fix MV Architecturally)*
@curly...@gmail.com  and I have been discussing a few
strategies to solve the fundamental issues with the current MV
architecture, such as reducing the possibility of inconsistency by an order
of magnitude. We are considering solving this within the DB itself.

tll;dr

For #1, using external frameworks, like Spark, makes it extremely easy
because we need to dump all the data in Memory for the Base table and MV to
do {A}-{B} records comparison. Not that it is not feasible within
Cassandra, but it is challenging, and we might end up creating the
mini-Spark type of application within Cassandra.
So, the current thinking is for #1; we rely on external frameworks to make
quick progress there so something is available in emergencies and divest
the energy towards #2.

Jaydeep

On Fri, Dec 6, 2024 at 10:28 AM Jeff Jirsa  wrote:

> It feels uncomfortable asking users to rely on a third party that’s as
> heavy-weight as spark to use a built-in feature.
>
> Can we really not do this internally? I get that the obvious way with
> merkle trees is hard because the range fanout of the MV using a different
> partitioner, but have we tried to think up a way to do this (somewhat
> efficiently) within the db?
>
>
> On Dec 6, 2024, at 9:08 AM, James Berragan  wrote:
>
> I think this would be useful and - having never really used Materialized
> Views - I didn't know it was an issue for some users. I would say the
> Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
> could be utilized for much of this, with a specialized Spark job for this
> purpose.
>
> On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia 
> wrote:
>
>> Hi,
>>
>> *NOTE: *This email does not promote using Cassandra's Materialized View
>> (MV) but assists those stuck with it for various reasons.
>>
>> The primary issue with MV is that once it goes out of sync with the base
>> table, no tooling is available to remediate it. This Spark job aims to fill
>> this gap by logically comparing the MV with the base table and identifying
>> inconsistencies. The job primarily does the following:
>>
>>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>>- Categorize each record into one of the four areas: a) Consistent,
>>b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
>>- Provide a detailed view of mismatches, such as the primary key, all
>>the non-primary key fields, and mismatched columns.
>>- Dumps the detailed information to an output folder path provided to
>>the job (one can extend the interface to dump the records to some object
>>store as well)
>>- Optionally, the job fixes the MV inconsistencies.
>>- Rich configuration (throttling, actionable output, capability to
>>specify the time range for the records, etc.) to run the job at Scale in a
>>production environment
>>
>> Design doc: link
>> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
>> The Git Repository: link
>> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
>>
>> *Motivation*
>>
>>1. This email's primary objective is to share with the community that
>>something like this is available for MV (in a private repository), which
>>may be helpful in emergencies to folks stuck with MV in production.
>>2. If we, as a community, want to officially foster tooling using
>>Spark because it can be helpful to do many things beyond the MV work, such
>>as counting rows, etc., then I am happy to drive the efforts.
>>
>> Please let me know what you think.
>>
>> Jaydeep
>>
>
>


[DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
Hi,

*NOTE: *This email does not promote using Cassandra's Materialized View
(MV) but assists those stuck with it for various reasons.

The primary issue with MV is that once it goes out of sync with the base
table, no tooling is available to remediate it. This Spark job aims to fill
this gap by logically comparing the MV with the base table and identifying
inconsistencies. The job primarily does the following:

   - Scans Base Table (A), MV (B), and do {A}-{B} analysis
   - Categorize each record into one of the four areas: a) Consistent, b)
   Inconsistent, c) MissingInMV, d) MissingInBaseTable
   - Provide a detailed view of mismatches, such as the primary key, all
   the non-primary key fields, and mismatched columns.
   - Dumps the detailed information to an output folder path provided to
   the job (one can extend the interface to dump the records to some object
   store as well)
   - Optionally, the job fixes the MV inconsistencies.
   - Rich configuration (throttling, actionable output, capability to
   specify the time range for the records, etc.) to run the job at Scale in a
   production environment

Design doc: link

The Git Repository: link


*Motivation*

   1. This email's primary objective is to share with the community that
   something like this is available for MV (in a private repository), which
   may be helpful in emergencies to folks stuck with MV in production.
   2. If we, as a community, want to officially foster tooling using Spark
   because it can be helpful to do many things beyond the MV work, such as
   counting rows, etc., then I am happy to drive the efforts.

Please let me know what you think.

Jaydeep


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
Thanks for the information, Yifan and James!Given that, we can scope this email discussion only for this specific MV repair. Two points:1. Can this MV repair job provide some value addition?2. If yes, does it even make sense to merge this MV repair tooling, which uses Spak as its underlying technology, with Cassandra Analytics?JaydeepOn Dec 6, 2024, at 3:58 PM, Yifan Cai  wrote:Oh, I just noticed that James already mentioned it. On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai <yc25c...@gmail.com> wrote:I would like to highlight an existing tooling for "many things beyond the MV work, such as counting rows, etc." The Apache Cassandra Analytics project (http://github.com/apache/cassandra-analytics/) could be a great resource for this type of task. It reads directly from the SSTables in the Spark executors, which avoids sending CQL queries that cloud stress the cluster or interfere with the production traffic. - YifanOn Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote:Hi,NOTE: This email does not promote using Cassandra's Materialized View (MV) but assists those stuck with it for various reasons.The primary issue with MV is that once it goes out of sync with the base table, no tooling is available to remediate it. This Spark job aims to fill this gap by logically comparing the MV with the base table and identifying inconsistencies. The job primarily does the following:Scans Base Table (A), MV (B), and do {A}-{B} analysisCategorize each record into one of the four areas: a) Consistent, b) Inconsistent, c) MissingInMV, d) MissingInBaseTableProvide a detailed view of mismatches, such as the primary key, all the non-primary key fields, and mismatched columns.Dumps the detailed information to an output folder path provided to the job (one can extend the interface to dump the records to some object store as well)Optionally, the job fixes the MV inconsistencies.Rich configuration (throttling, actionable output, capability to specify the time range for the records, etc.) to run the job at Scale in a production environmentDesign doc: linkThe Git Repository: linkMotivationThis email's primary objective is to share with the community that something like this is available for MV (in a private repository), which may be helpful in emergencies to folks stuck with MV in production.If we, as a community, want to officially foster tooling using Spark because it can be helpful to do many things beyond the MV work, such as counting rows, etc., then I am happy to drive the efforts.Please let me know what you think.Jaydeep











[UPDATE] CEP-37

2025-03-15 Thread Jaydeep Chovatia
Hello Everyone,

I wanted to update you on CEP-37

(Jira:
CASSANDRA-19918 )
work.
Over the last year, some of us (Andy Tolbert, Chris Lohfink, Francisco
Guerrero, and Kristijonas Zalys) have been working closely on making
CEP-37 rock solid, with support from Josh McKenzie, Dinesh Joshi, and David
Capwell.
First and foremost, a huge thank you to everyone, including the
broader Apache Cassandra community, for their invaluable contributions in
making CEP-37 robust and solid!

Here is the current status:

*Feature stability*

   - *Voted feature:* All the features mentioned in CEP-37 have worked as
   expected.
   - *Post-voted feature:* A few new minor improvements
   

   have been added to post-voting, and they are also working as expected.
   - Tested the functionality by multiple people over the period of time.
   - Some other facts: it has already been validated at scale
   . Another big Cassandra use
   case is in the process of validating/adopting it in their environment.

*Source Code*

   - It is an opt-in feature; nobody notices anything unless someone opts
   in.
   - By default, this feature is pretty isolated (in a separate package)
   from the source code point of view (94% of the source code lines are in the
   new files)
   - A thorough documentation has been added:
  - overview.doc
  - metrics.doc
  - cassandra.yaml doc
  - NEWS.txt overview
   - Five people (Andy Tolbert, Chris Lohfink, Francisco Guerrero, and
   Kristijonas Zalys) have contributed.
   - The source code has been reviewed multiple times by the same five
   people.

*Test Coverage*

   - A comprehensive test coverage has been added to cover all aspects.
   - The entire test suite has been passing


We are in the final review phase and nearly ready to merge. If anyone has
any last-minute feedback, this is the final opportunity for review.

Thank you!
Andy Tolbert, Chris Lohfink, Francisco Guerrero, Kristijonas Zalys, and
Jaydeep


Re: Welcome Jaydeepkumar Chovatia as Cassandra committer

2025-05-01 Thread Jaydeep Chovatia
Thank you, everyone!

Jaydeep

On Thu, May 1, 2025 at 5:27 AM Marouan REJEB 
wrote:

> Congratulations !
>
> On Thu, May 1, 2025 at 1:22 PM Aaron Ploetz  wrote:
>
>> Congratulations Jaydeep!
>>
>>
>> > On Apr 30, 2025, at 8:18 AM, Paulo Motta  wrote:
>> >
>> > Congratulations Jaydeep!
>>
>


Re: [UPDATE] CEP-37

2025-04-23 Thread Jaydeep Chovatia
The CEP-37 work has been successfully merged into the trunk today! Please
let me know if you have any issues.

This merge is a massive win for Apache Cassandra — a significant step
forward. But we're not stopping here. There's more to come, and we are
committed to pushing repair automation even further and closing the gaps in
the remaining flows. A few examples:

   1. Automatically running repair as part of the node replacement: Design
   
<https://docs.google.com/document/d/1SZIQPbIWNDsbWnIk5N5tyQCQzJ4ypwuhH-t5dO5WeZs/edit?tab=t.0>
   & POC <https://github.com/jaydeepkumar1984/cassandra/pull/54> is already
   out [CASSANDRA-20281
   <https://issues.apache.org/jira/browse/CASSANDRA-20281>]
   2. Stopping repair automatically between Cassandra major version
   upgrades [CASSANDRA-20048
   <https://issues.apache.org/jira/browse/CASSANDRA-20048>]
   3. Repairing automatically when Keyspace replication changes [
   CASSANDRA-20582 <https://issues.apache.org/jira/browse/CASSANDRA-20582>]


Thanks for all the help and support from the Apache Cassandra community!

Yours sincerely,
Andy Tolbert, Chris Lohfink, Francisco Guerrero, Kristijonas Zalys, and
Jaydeep

On Sun, Mar 9, 2025 at 8:53 PM Jaydeep Chovatia 
wrote:

> Thanks a lot, Jon!
> This has truly been a team effort, with Andy Tolbert, Chris Lohfink,
> Francisco Guerrero, and Kristijonas Zalys all contributing over the past
> year. The credit belongs to everyone!
>
> Jaydeep
>
>
>
>
>
> On Sun, Mar 9, 2025 at 2:35 PM Jon Haddad  wrote:
>
>> This is all really exciting.  Getting a built in, orchestrated repair is
>> a massive achievement.  Thank you for your work on this, it's incredibly
>> valuable to the community!!
>>
>> Jon
>>
>> On Sun, Mar 9, 2025 at 2:25 PM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> No problem, Dave! Thank you.
>>>
>>> Jaydeep
>>>
>>> On Sun, Mar 9, 2025 at 10:46 AM Dave Herrington 
>>> wrote:
>>>
>>>> Jaydeep,
>>>>
>>>> Thank you for taking time to answer my questions and for the links to
>>>> the design and overview docs, which are excellent and answer all of my
>>>> remaining questions.  Sorry I missed those links in the CEP page.
>>>>
>>>> Great work and I will continue to follow your progress on this powerful
>>>> new feature.
>>>>
>>>> Thanks!
>>>> -Dave
>>>>
>>>> On Sat, Mar 8, 2025 at 9:36 AM Jaydeep Chovatia <
>>>> chovatia.jayd...@gmail.com> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> Thanks for the kind words!
>>>>>
>>>>> >Is there a goal in this CEP to make automated repair work during
>>>>> rolling upgrades, when multiple versions exist in the cluster?
>>>>> We debated a lot on this over ASF Slack
>>>>> (#cassandra-repair-scheduling-cep37). The summary is that, ideally, we 
>>>>> want
>>>>> to have a repair function during the mixed version, but the reality is 
>>>>> that
>>>>> currently, there is no test suite available inside Apache Cassandra to
>>>>> verify the streaming behavior during the mixed version, so the confidence
>>>>> is low.
>>>>> We agreed on the following: 1) Keeping safety in mind, we should by
>>>>> default disable the repair during mixed version 2) Add a comprehensive 
>>>>> test
>>>>> suite 3) Allow repair during mixed version. Currently, we are at #1
>>>>>
>>>>> >Would automated repair be smart enough to automatically stop, if it
>>>>> sees incompatible versions?
>>>>> That's the plan, and we already have PR (CASSANDRA-20048
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-20048>) out from
>>>>> Chris Lohfink. The thing we are debating is whether to stop only during
>>>>> major version mismatch or also during the minor version, and we are 
>>>>> leaning
>>>>> towards only disabling for the major version mismatch. Regardless, this
>>>>> should be available soon.
>>>>> We are also extending this further as per feedback from David
>>>>> Capwell that we should automatically stop repair if we detect a new DC or
>>>>> keyspace RF is changed. That will be covered later as part of
>>>>> CASSANDRA-20414
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-20414>
>>>>>
&g

Re: [Committer/reviewer needed] Request for Review of Cassandra PRs

2025-04-21 Thread Jaydeep Chovatia
One more ticket has been reviewed by one committer and needs another
committer to review.


   - Local Hints are stepping on local mutations [CASSANDRA-19958
   ]


Jaydeep

On Mon, Apr 21, 2025 at 9:39 AM Runtian Liu  wrote:

> Hi,
> A gentle reminder on the following two changes:
>
>1. *Inbound stream throttler* [CASSANDRA−
>11303
>]
>2. *Clear existing data when reset_bootstrap_progress is true* [
>CASSANDRA− 20097
>]
>
> I could use some help reviewing the inbound stream throttler patch. This
> change is particularly useful for deployments using vnodes, where node
> replacements often involve streaming from multiple sources.
> The second change has already received a +1. Could we please get another
> committer to review it?
>
> Thanks,
> Runtian
>
>
> On Thu, Jan 16, 2025 at 10:14 PM guo Maxwell  wrote:
>
>> I have added you as a reviewer of CASSANDRA-11303, you can do that for
>> CASSANDRA− 20097
>>  by yourself.
>>
>> guo Maxwell  于2025年1月17日周五 14:10写道:
>>
>>> Hi Wang,
>>> I think Runtian should change the ticket 's status ,submit a pr , then
>>> you can add yourself as a reviewer.
>>>
>>>
>>> Cheng Wang via dev  于2025年1月17日周五 14:05写道:
>>>
 I can review the CASSANDRA-19248 first since it looks straight forward.
 For CASSANDRA−11303 and CASSANDRA−
 20097
  I will try but
 can't promise anything yet. How to add myself as a reviewer?

 On Thu, Jan 16, 2025 at 7:33 PM Paulo Motta  wrote:

> Thanks for this message Runtian.
>
> I realized I was previously assigned as reviewer for CASSANDRA−11303
> but I will not be able to review it at this time.
>
> There's also CASSANDRA-19248 that is waiting for review in case
> someone has cycles to review it.
>
> Reviewing community patches is a great way to learn more about the
> codebase and get involved with the project, even if you are not a
> committer. Any contributor can help by doing a first-pass review to triage
> the issue, run tests and check suggested items of the review checklist 
> [1].
>
> Paulo
>
> [1] - https://cassandra.apache.org/_/development/how_to_review.html
>
> On Thu, Jan 16, 2025 at 7:34 PM Runtian Liu 
> wrote:
>
>> Hi all,
>>
>> I hope you are doing well. I would like to request your review of two
>> pull requests I've submitted earlier:
>>
>>1. *Inbound stream throttler* [CASSANDRA−
>>11303
>>]
>>2. *Clear existing data when reset_bootstrap_progress is true* [CA
>>SSANDRA− 
>>20097 ]
>>
>> I'd really appreciate any thoughts you have on these changes. If you
>> see anything that needs tweaking, just let me know, and I'll jump right 
>> on
>> it!
>>
>> Thanks,
>>
>> Runtian
>>
>


Re: [UPDATE] CEP-37

2025-03-08 Thread Jaydeep Chovatia
ed to limit their size?
>
> Thanks,
> -Dave
>
> David A. Herrington II
> President and Chief Engineer
> RhinoSource, Inc.
>
> *Data Lake Architecture, Cloud Computing and Advanced Analytics.*
>
> www.rhinosource.com
>
>
> On Fri, Mar 7, 2025 at 11:48 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I wanted to update you on CEP-37
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution>
>>  (Jira:
>> CASSANDRA-19918 <https://issues.apache.org/jira/browse/CASSANDRA-19918>)
>> work.
>> Over the last year, some of us (Andy Tolbert, Chris Lohfink, Francisco
>> Guerrero, and Kristijonas Zalys) have been working closely on making
>> CEP-37 rock solid, with support from Josh McKenzie, Dinesh Joshi, and David
>> Capwell.
>> First and foremost, a huge thank you to everyone, including the
>> broader Apache Cassandra community, for their invaluable contributions in
>> making CEP-37 robust and solid!
>>
>> Here is the current status:
>>
>> *Feature stability*
>>
>>- *Voted feature:* All the features mentioned in CEP-37 have worked
>>as expected.
>>- *Post-voted feature:* A few new minor improvements
>>
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=272927365#CEP37ApacheCassandraUnifiedRepairSolution-Post-VoteUpdates>
>>have been added to post-voting, and they are also working as expected.
>>- Tested the functionality by multiple people over the period of time.
>>- Some other facts: it has already been validated at scale
>><https://www.youtube.com/watch?v=xFicEj6Nhq8>. Another big Cassandra
>>use case is in the process of validating/adopting it in their environment.
>>
>> *Source Code*
>>
>>- It is an opt-in feature; nobody notices anything unless someone
>>opts in.
>>- By default, this feature is pretty isolated (in a separate package)
>>from the source code point of view (94% of the source code lines are in 
>> the
>>new files)
>>- A thorough documentation has been added:
>>   - overview.doc
>>   - metrics.doc
>>   - cassandra.yaml doc
>>   - NEWS.txt overview
>>- Five people (Andy Tolbert, Chris Lohfink, Francisco Guerrero, and
>>Kristijonas Zalys) have contributed.
>>- The source code has been reviewed multiple times by the same five
>>people.
>>
>> *Test Coverage*
>>
>>- A comprehensive test coverage has been added to cover all aspects.
>>- The entire test suite has been passing
>>
>>
>> We are in the final review phase and nearly ready to merge. If anyone has
>> any last-minute feedback, this is the final opportunity for review.
>>
>> Thank you!
>> Andy Tolbert, Chris Lohfink, Francisco Guerrero, Kristijonas Zalys, and
>> Jaydeep
>>
>


Re: [UPDATE] CEP-37

2025-03-09 Thread Jaydeep Chovatia
No problem, Dave! Thank you.

Jaydeep

On Sun, Mar 9, 2025 at 10:46 AM Dave Herrington 
wrote:

> Jaydeep,
>
> Thank you for taking time to answer my questions and for the links to the
> design and overview docs, which are excellent and answer all of my
> remaining questions.  Sorry I missed those links in the CEP page.
>
> Great work and I will continue to follow your progress on this powerful
> new feature.
>
> Thanks!
> -Dave
>
> On Sat, Mar 8, 2025 at 9:36 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hi David,
>>
>> Thanks for the kind words!
>>
>> >Is there a goal in this CEP to make automated repair work during rolling
>> upgrades, when multiple versions exist in the cluster?
>> We debated a lot on this over ASF Slack
>> (#cassandra-repair-scheduling-cep37). The summary is that, ideally, we want
>> to have a repair function during the mixed version, but the reality is that
>> currently, there is no test suite available inside Apache Cassandra to
>> verify the streaming behavior during the mixed version, so the confidence
>> is low.
>> We agreed on the following: 1) Keeping safety in mind, we should by
>> default disable the repair during mixed version 2) Add a comprehensive test
>> suite 3) Allow repair during mixed version. Currently, we are at #1
>>
>> >Would automated repair be smart enough to automatically stop, if it sees
>> incompatible versions?
>> That's the plan, and we already have PR (CASSANDRA-20048
>> <https://issues.apache.org/jira/browse/CASSANDRA-20048>) out from Chris
>> Lohfink. The thing we are debating is whether to stop only during major
>> version mismatch or also during the minor version, and we are leaning
>> towards only disabling for the major version mismatch. Regardless, this
>> should be available soon.
>> We are also extending this further as per feedback from David
>> Capwell that we should automatically stop repair if we detect a new DC or
>> keyspace RF is changed. That will be covered later as part of
>> CASSANDRA-20414 <https://issues.apache.org/jira/browse/CASSANDRA-20414>
>>
>> >If automated repair must be disabled for the entire cluster, will this
>> be a single nodetool command, or must automated repair be disabled on each
>> node individually?
>> Yes, it is a nodetool command and does not require any restarts! All the
>> *nodetool* command details are currently covered in the design doc
>> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit?tab=t.0#heading=h.89fmsespiosd>,
>> and the same details will also be available in the Cassandra
>> overview.adoc
>> <https://github.com/apache/cassandra/pull/3598/files?short_path=e901018#diff-e90101885c1188844bb4188d1301277bfdc4a9e1e705c4ab8a6cc5a4b44460c0>
>> .
>>
>> >Would it make sense for automated repair to upgrade sstables, if it
>> finds old formats? (Maybe this could be a feature that could be optionally
>> enabled?)
>> My opinion is that it should not be part of the repair. It is best suited
>> as part of the Cassandra upgrade framework; I guess Paulo M is looking at
>> it.
>>
>> >W.R.T. the repair logging tables in the system_distributed keyspace,
>> will these tables have a configurable TTL, or must they be periodically
>> truncated to limit their size?
>> The number of entries will equal the number of Cassandra nodes in a
>> cluster. There is no TTL because each row represents the repair status of
>> that particular node. The entries would be automatically added/removed as
>> nodes are added/removed from the Cassandra cluster.
>>
>> Jaydeep
>>
>> On Sat, Mar 8, 2025 at 7:46 AM Dave Herrington 
>> wrote:
>>
>>> Jaydeep,
>>>
>>> Thank you for your excellent efforts on this mission-critical feature.
>>> The stated goals of CEP-37 are noble and stand to make valuable
>>> improvements for cluster operations.  I look forward to testing these new
>>> capabilities.
>>>
>>> My apologies up-front if you’ve already answered these questions.  I did
>>> read the CEP a number of times and the linked JIRAs, but these are my
>>> questions that I couldn’t answer myself.
>>>
>>> I’m interested to understand the goals of CEP-37 W.R.T. to rolling
>>> upgrades of large clusters, as I am responsible for maintaining the cluster
>>> operations runbooks for a number of customers.
>>>
>>> Operators have to navigate the upgrade gauntlet with automated repairs
>>> disabled and get al

Re: [UPDATE] CEP-37

2025-03-09 Thread Jaydeep Chovatia
Thanks a lot, Jon!
This has truly been a team effort, with Andy Tolbert, Chris Lohfink,
Francisco Guerrero, and Kristijonas Zalys all contributing over the past
year. The credit belongs to everyone!

Jaydeep





On Sun, Mar 9, 2025 at 2:35 PM Jon Haddad  wrote:

> This is all really exciting.  Getting a built in, orchestrated repair is a
> massive achievement.  Thank you for your work on this, it's incredibly
> valuable to the community!!
>
> Jon
>
> On Sun, Mar 9, 2025 at 2:25 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> No problem, Dave! Thank you.
>>
>> Jaydeep
>>
>> On Sun, Mar 9, 2025 at 10:46 AM Dave Herrington 
>> wrote:
>>
>>> Jaydeep,
>>>
>>> Thank you for taking time to answer my questions and for the links to
>>> the design and overview docs, which are excellent and answer all of my
>>> remaining questions.  Sorry I missed those links in the CEP page.
>>>
>>> Great work and I will continue to follow your progress on this powerful
>>> new feature.
>>>
>>> Thanks!
>>> -Dave
>>>
>>> On Sat, Mar 8, 2025 at 9:36 AM Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> Thanks for the kind words!
>>>>
>>>> >Is there a goal in this CEP to make automated repair work during
>>>> rolling upgrades, when multiple versions exist in the cluster?
>>>> We debated a lot on this over ASF Slack
>>>> (#cassandra-repair-scheduling-cep37). The summary is that, ideally, we want
>>>> to have a repair function during the mixed version, but the reality is that
>>>> currently, there is no test suite available inside Apache Cassandra to
>>>> verify the streaming behavior during the mixed version, so the confidence
>>>> is low.
>>>> We agreed on the following: 1) Keeping safety in mind, we should by
>>>> default disable the repair during mixed version 2) Add a comprehensive test
>>>> suite 3) Allow repair during mixed version. Currently, we are at #1
>>>>
>>>> >Would automated repair be smart enough to automatically stop, if it
>>>> sees incompatible versions?
>>>> That's the plan, and we already have PR (CASSANDRA-20048
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-20048>) out from
>>>> Chris Lohfink. The thing we are debating is whether to stop only during
>>>> major version mismatch or also during the minor version, and we are leaning
>>>> towards only disabling for the major version mismatch. Regardless, this
>>>> should be available soon.
>>>> We are also extending this further as per feedback from David
>>>> Capwell that we should automatically stop repair if we detect a new DC or
>>>> keyspace RF is changed. That will be covered later as part of
>>>> CASSANDRA-20414 <https://issues.apache.org/jira/browse/CASSANDRA-20414>
>>>>
>>>> >If automated repair must be disabled for the entire cluster, will this
>>>> be a single nodetool command, or must automated repair be disabled on each
>>>> node individually?
>>>> Yes, it is a nodetool command and does not require any restarts! All
>>>> the *nodetool* command details are currently covered in the design doc
>>>> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit?tab=t.0#heading=h.89fmsespiosd>,
>>>> and the same details will also be available in the Cassandra
>>>> overview.adoc
>>>> <https://github.com/apache/cassandra/pull/3598/files?short_path=e901018#diff-e90101885c1188844bb4188d1301277bfdc4a9e1e705c4ab8a6cc5a4b44460c0>
>>>> .
>>>>
>>>> >Would it make sense for automated repair to upgrade sstables, if it
>>>> finds old formats? (Maybe this could be a feature that could be optionally
>>>> enabled?)
>>>> My opinion is that it should not be part of the repair. It is best
>>>> suited as part of the Cassandra upgrade framework; I guess Paulo M is
>>>> looking at it.
>>>>
>>>> >W.R.T. the repair logging tables in the system_distributed keyspace,
>>>> will these tables have a configurable TTL, or must they be periodically
>>>> truncated to limit their size?
>>>> The number of entries will equal the number of Cassandra nodes in a
>>>> cluster. There is no TTL because each row represents the repair status of
&

Re: Welcome Abe Ratnofsky as Cassandra committer!

2025-05-12 Thread Jaydeep Chovatia
Congrats Abe!

On Mon, May 12, 2025 at 11:03 AM Doug Rohrer  wrote:

> Congrats Abe!
>
> On May 12, 2025, at 12:45 PM, Alex Petrov  wrote:
>
> Hello folks of the dev list,
>
> The Apache Cassandra PMC is very glad to announce that Abe Ratnofsky has
> accepted our invitation to become a committer!
>
> Abe has been actively contributing to Cassandra itself, made outstanding
> contributions to the Cassandra drivers, played a key role in the recently
> accepted CEP-45 [1], and has been active in the community — including on
> this mailing list and on Cassandra conferences and meetups.
>
> Please join us in congratulating and welcoming Abe!
>
> Alex Petrov
> on behalf of the Apache Cassandra PMC
> [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45:+Mutation+Tracking
>
>
>
>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-12 Thread Jaydeep Chovatia
>Like something doesn't add up here because if it always includes the base
table's primary key columns that means

The requirement for materialized views (MVs) to include the base table's
primary key appears to be primarily a syntactic constraint specific to
Apache Cassandra. For instance, in DynamoDB, the DDL for defining a Global
Secondary Index does not mandate inclusion of the base table's primary key.
This suggests that the syntax requirement in Cassandra could potentially be
relaxed in the future (outside the scope of this CEP). As Benedict noted,
the base table's primary key is optional when querying a materialized view.

Jaydeep

On Mon, May 12, 2025 at 10:45 AM Jon Haddad  wrote:

>
> > Or compaction hasn’t made a mistake, or cell merge reconciliation
> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file.
> > Repair isnt’ just about “have all transaction commits landed”. It’s “is
> the data correct N days after it’s written”.
>
> Don't forget about restoring from a backup.
>
> Is there a way we could do some sort of hybrid compaction + incremental
> repair?  Maybe have the MV verify it's view while it's compacting, and when
> it's done, mark the view's SSTable as repaired?  Then the repair process
> would only need to do a MV to MV repair.
>
> Jon
>
>
> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
> bened...@apache.org> wrote:
>
>> Like something doesn't add up here because if it always includes the base
>> table's primary key columns that means they could be storage attached by
>> just forbidding additional columns and there doesn't seem to be much
>> utility in including additional columns in the primary key?
>>
>>
>> You can re-order the keys, and they only need to be a part of the primary
>> key not the partition key. I think you can specify an arbitrary order to
>> the keys also, so you can change the effective sort order. So, the basic
>> idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>>
>> This is basically a global index, with the restriction on single columns
>> as keys only because we cannot cheaply read-before-write for eventually
>> consistent operations. This restriction can easily be relaxed for Paxos and
>> Accord based implementations, which can also safely include additional keys.
>>
>> That said, I am not at all sure why they are called materialised views if
>> we don’t support including any other data besides the lookup column and the
>> primary key. We should really rename them once they work, both to make some
>> sense and to break with the historical baggage.
>>
>> I think this can be represented as a tombstone which can always be
>> fetched from the base table on read or maybe some other arrangement? I
>> agree it can't feasibly be represented as an enumeration of the deletions
>> at least not synchronously and doing it async has its own problems.
>>
>>
>> If the base table must be read on read of an index/view, then I think
>> this proposal is approximately linearizable for the view as well (though, I
>> do not at all warrant this statement). You still need to propagate this
>> eventually so that the views can cleanup. This also makes reads 2RT on
>> read, which is rather costly.
>>
>> On 12 May 2025, at 16:10, Ariel Weisberg  wrote:
>>
>> Hi,
>>
>> I think it's worth taking a step back and looking at the current MV
>> restrictions which are pretty onerous.
>>
>> A view must have a primary key and that primary key must conform to the
>> following restrictions:
>>
>>- it must contain all the primary key columns of the base table. This
>>ensures that every row of the view correspond to exactly one row of the
>>base table.
>>- it can only contain a single column that is not a primary key
>>column in the base table.
>>
>> At that point what exactly is the value in including anything except the
>> original primary key in the MV's primary key columns unless you are using
>> an ordered partitioner so you can iterate based on the leading primary key
>> columns?
>>
>> Like something doesn't add up here because if it always includes the base
>> table's primary key columns that means they could be storage attached by
>> just forbidding additional columns and there doesn't seem to be much
>> utility in including additional columns in the primary key?
>>
>> I'm not that clear on how much better it is to look something up in the
>> MV vs just looking at the base table or some non-materialized view of it.
>> How exactly are these MVs supposed to be used and what value do they
>> provide?
>>
>> Jeff Jirsa wrote:
>>
>> There’s 2 things in this proposal that give me a lot of pause.
>>
>>
>> Runtian Liu pointed out that the CEP is sort of divided into two parts.
>> The first is the online part which is making reads/writes to MVs safer and
>> more reliable using a transaction system. The second is offline which is
>> repair.
>>
>> The story for the online portion I think is quite strong and worth
>> considering on its own

Re: CASSANDRA-20490 Question about broken forced ephemeral snapshots during repair

2025-05-13 Thread Jaydeep Chovatia
>So if we moved to determanisticId for both branches of if / else and
remove force, we should be good? RepairJobDesc#deterministicId is basically
unique?
The challenge with using RepairJobDesc#deterministicId in the SNAPSHOT_MSG is
that it will create N snapshots for N token-splits for a table. However,
CLEANUP_MSG is only invoked once for a table, complicating the cleanup of
the snapshots.

Jaydeep



On Tue, May 13, 2025 at 8:56 AM Štefan Miklošovič 
wrote:

> That true / false for "force" argument seems to be purely derived from
> whether it is global repair (force set to false) or not (set to true).
>
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192
>
> On Tue, May 13, 2025 at 5:25 PM David Capwell  wrote:
>
>> That method can be backported even if it’s just for this single case. For
>> the vtable I needed something more friendly so made this for that. It’s
>> also used for dedup logic in repair retry works.
>>
>> I didn’t look closely, what sets force to be true or false? I assumed
>> this was the force flag in repair (as these are repair snapshots) but
>> sounds like that isn’t the case?
>>
>> Sent from my iPhone
>>
>> On May 12, 2025, at 11:58 PM, Štefan Miklošovič 
>> wrote:
>>
>> 
>> David,
>>
>> that "force" you copied from nodetool, that has nothing to do with
>> "force" in these snapshots but I think you already know that.
>>
>> org.apache.cassandra.repair.RepairJobDesc#determanisticId is only in 4.1+
>> included, not in 4.0. That's quite a bummer but we could limit the patch
>> for 4.1+ only and leave 4.0 be.
>>
>> When we do that, then "force" will become redundant and we can perhaps
>> remove it?
>>
>> How I look at it is that the introduction of "force" flag in
>> TableRepairManager is the result of a developer trying to consolidate the
>> name of snapshots from parent id / session id to both parent id / parent id
>> but there was the realisation of that being problematic because else branch
>> might be invoked multiple times so it would start to clash. So they said
>> "ok and here we just override and take no matter what" but the
>> implementation of that was not finished as it clearly fails.
>>
>> So if we moved to determanisticId for both branches of if / else and
>> remove force, we should be good? RepairJobDesc#deterministicId is basically
>> unique? I see it is created from parent id, session id, keyspace, table and
>> ranges. I do not think there would ever be two cases of snapshots like that
>> with exact same values.
>>
>> WDYT?
>>
>> Regards
>>
>> On Mon, May 12, 2025 at 6:55 PM David Capwell  wrote:
>>
>>> > "force" can be true / false. When true, then it will not check if a
>>> snapshot exists
>>>
>>>
>>> My mental node for “force” was only to deal with down nodes, and nothing
>>> to do with snapshots… so this feels like a weird behavior
>>>
>>> @Option(title = "force", name = {"-force", "--force"}, description =
>>> "Use -force to filter out down endpoints”)
>>>
>>> Our documentation even only calls out dealing with downed enodes…
>>>
>>>
>>> >   a) we will remove existing snapshot and take a new one
>>>
>>> So this is racy and can make repairs brittle.  The snapshot is at the
>>> RepairSession (global) or RepairJob level (a subset of ranges for a single
>>> table, different RepairJobs can work on the same table, just different
>>> ranges). With this code path you also have 1 more variable at play that
>>> makes things complex: isGlobal
>>>
>>> public boolean isGlobal() {return dataCenters.isEmpty() &&
>>> hosts.isEmpty();}
>>>
>>> Global does the full range, where as non-global covers the partial range
>>>
>>> So when you do host or dc this snapshot is isolated to the RepairJob and
>>> not the session logically, but physically its isolated to the session;
>>> which doesn’t make any sense.
>>>
>>> > 2) if we do not want to fix "force", then it is most probably just
>>> redundant and we would remove it but it order to not fail, we would need to
>>> go back to naming snapshots not parent id / parent id but parent id /
>>> session id for global and non-global repair respectively (basically we
>>> would return to pre-14116 behavior).
>>>
>>> org.apache.cassandra.repair.RepairJobDesc#determanisticId
>>>
>>> This is scoped to a single RepairJob, so wouldn’t have the issues with
>>> concurrency.  So if we were going to alter what the snapshot name is, I
>>> would strongly prefer this (its also present in the repair vtable, so not
>>> some hidden UUID)
>>>
>>> I personally feel the lowest risk is to switch to the job name and away
>>> from the session name we had… I do wonder about removing this “force”
>>> semantic as its not documented to users as far as I can tell, so is there
>>> any place that defines this behavior?
>>>
>>> > we might just fix the trunk and keep the rest broken, not ideal but
>>>
>>> With some patches its hard to tell if something is a bug or a feature…
>>> so altering the semantic is