Re: [Discuss] Repair inside C*

2024-11-01 Thread Jaydeep Chovatia
FYI..I've updated the CEP-37 content to include all the improvements we have discussed over this ML, ongoing work, and future work. I've also tagged JIRAs for each ongoing/future work and assigned to some individuals with a tentative timeline. Thanks, Mick, for suggesting capturing the ML discussio

Re: [Discuss] Repair inside C*

2024-10-30 Thread Jaydeep Chovatia
Thanks for the kind words, Mick! Jaydeep On Wed, Oct 30, 2024 at 1:35 AM Mick Semb Wever wrote: > Thanks Jaydeep. I've exhausted my lines on enquiry and am happy that > thought has gone into them. > > > >> On top of the above list, here is my recommendation (this is just a pure >> thought only

Re: [Discuss] Repair inside C*

2024-10-30 Thread Mick Semb Wever
Thanks Jaydeep. I've exhausted my lines on enquiry and am happy that thought has gone into them. > On top of the above list, here is my recommendation (this is just a pure > thought only and subject to change depending on how all the community > members see it): > >- Nov-Dec: We can definit

Re: [Discuss] Repair inside C*

2024-10-29 Thread Jaydeep Chovatia
>Repairs causing a node to OOM is not unusual. I've been working with a customer in this situation the past few weeks. Getting fixes out, or mitigating the problem, is not always as quick as one hopes (see my previous comment about how the repair_session_size setting gets easily clobbered today)

Re: [Discuss] Repair inside C*

2024-10-29 Thread Mick Semb Wever
Jaydeep, your replies address my main concerns, there's a few questions of curiosity as replies inline below… > >Without any per-table scheduling and history (IIUC) a node would have > to restart the repairs for all keyspaces and tables. > > The above-mentioned quote should work fine and wi

Re: [Discuss] Repair inside C*

2024-10-29 Thread Jaydeep Chovatia
>Since the auto repair is running from within Cassandra, we might have more control over this and implement a proper cleanup of such snapshots. Rightly said, Alexander. Having internal knowledge of Cassandra, we can do a lot more. For example, for better Incremental Reliability reliability, Andy T

Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
> > > > That's inaccurate, we can check the replica set for the subrange we're about to run and see if it overlaps with the replica set of other ranges which are being processed already. We can definitely check the replicas for the subrange we plan to run and see if they overlap with the ongoing on

Re: [Discuss] Repair inside C*

2024-10-28 Thread Jeff Jirsa
> On Oct 28, 2024, at 9:52 PM, Alexander Dejanovski > wrote: > >  > >> If a repair session finishes gracefully, then this timeout is not >> applicable. Anyway, I do not have any strong opinion on the value. I am open >> to lowering it to 1h or something. > True, it will only delay killing

Re: [Discuss] Repair inside C*

2024-10-28 Thread Alexander Dejanovski
> > The scheduler repairs, by default, the primary ranges for all the nodes > going through the repair. Since it uses the primary ranges, all the nodes > repairing parallelly would not overlap in any form for the primary ranges. > However, the replica set for the nodes going through repair may or m

Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
Thanks, Mick, for the comment, please find my response below. >(1) I think I covered most of the points in my response to Alexander (except one, which I am responding to below separately). Tl;dr is the MVP that can be easily extended to do a table-level schedule; it is just going to be another CQ

Re: [Discuss] Repair inside C*

2024-10-28 Thread Jaydeep Chovatia
Thanks a lot, Alexander, for the review! Please find my response below: > making these replicas process 3 concurrent repairs while others could be left uninvolved in any repair at all...Taking a range centric approach (we're not repairing nodes, we're repairing the token ranges) allows to spread

Re: [Discuss] Repair inside C*

2024-10-28 Thread Mick Semb Wever
any name works for me, Jaydeep :-) I've taken a run through of the CEP, design doc, and current PR. Below are my four (rough categories of) questions. I am keen to see a MVP land, so I'm more looking at what the CEP's design might not be able to do, rather than what may or may not land in an init

Re: [Discuss] Repair inside C*

2024-10-28 Thread Alexander DEJANOVSKI
Hi Jaydeep, I've taken a look at the proposed design and have a few comments/questions. As one of the maintainers of Reaper, I'm looking this through the lens of how Reaper does things. *The approach taken in the CEP-37 design is "node-centric" vs a "range centric" approach (which is the one Rea

Re: [Discuss] Repair inside C*

2024-10-22 Thread Benedict
I realise it’s out of scope, but to counterbalance all of the pro-decomposition messages I wanted to chime in with a strong -1. But we can debate that in a suitable context later.On 22 Oct 2024, at 16:36, Jordan West wrote:Agreed with the sentiment that decomposition is a good target but out of s

Re: [Discuss] Repair inside C*

2024-10-22 Thread Jordan West
Agreed with the sentiment that decomposition is a good target but out of scope here. I’m personally excited to see an in-tree repair scheduler and am supportive of the approach shared here. Jordan On Tue, Oct 22, 2024 at 08:12 Dinesh Joshi wrote: > Decomposing Cassandra may be architecturally d

Re: [Discuss] Repair inside C*

2024-10-22 Thread Dinesh Joshi
On Mon, Oct 21, 2024 at 9:18 AM David Capwell wrote: > One thing to keep in mind is that larger clusters require you “smartly” > split the ranges else you nuke your cluster… knowing how to split requires > internal knowledge from the database which we could expose, but then we > need to expose a

Re: [Discuss] Repair inside C*

2024-10-22 Thread Dinesh Joshi
Decomposing Cassandra may be architecturally desirable but that is not the goal of this CEP. This CEP brings value to operators today so it should be considered on that merit. We definitely need to have a separate conversation on Cassandra's architectural direction. On Tue, Oct 22, 2024 at 7:51 AM

Re: [Discuss] Repair inside C*

2024-10-22 Thread Joseph Lynch
Definitely like this in C* itself. We only changed our proposal to putting repair scheduling in the sidecar before because trunk was frozen for the foreseeable future at that time. With trunk unfrozen and development on the main process going at a fast pace I think it makes way more sense to integr

Re: [Discuss] Repair inside C*

2024-10-21 Thread Francisco Guerrero
Like others have said, I was expecting the scheduling portion of repair is negligible. I was mostly curious if you had something handy that you can quickly share. On 2024/10/21 18:59:41 Jaydeep Chovatia wrote: > >Jaydeep, do you have any metrics on your clusters comparing them before > and after i

Re: [Discuss] Repair inside C*

2024-10-21 Thread Jaydeep Chovatia
>Jaydeep, do you have any metrics on your clusters comparing them before and after introducing repair scheduling into the Cassandra process? Yes, I had made some comparisons when I started rolling this feature out to our production five years ago :) Here are the details: *The Scheduling* The sche

Re: [Discuss] Repair inside C*

2024-10-21 Thread Patrick McFadin
> While worth pursuing, I think we would need a different CEP just to figure > out how to do that. Not only is there a lot of infrastructure difficulty in > running multi process, the inter app communication needs to be figured out > better then JMX. I strongly agree and this is a good time to

Re: [Discuss] Repair inside C*

2024-10-21 Thread Francisco Guerrero
Jaydeep, do you have any metrics on your clusters comparing them before and after introducing repair scheduling into the Cassandra process? On 2024/10/21 16:57:57 "J. D. Jordan" wrote: > Sounds good. Just wanted to bring it up. I agree that the scheduling bit is > pretty light weight and the ideal

Re: [Discuss] Repair inside C*

2024-10-21 Thread J. D. Jordan
Sounds good. Just wanted to bring it up. I agree that the scheduling bit is pretty light weight and the ideal would be to bring the whole of the repair external, which is a much bigger can of worms to open.-JeremiahOn Oct 21, 2024, at 11:21 AM, Chris Lohfink wrote:> I actually think we should be

Re: [Discuss] Repair inside C*

2024-10-21 Thread Chris Lohfink
> I actually think we should be looking at how we can move things out of the database process. While worth pursuing, I think we would need a different CEP just to figure out how to do that. Not only is there a lot of infrastructure difficulty in running multi process, the inter app communication n

Re: [Discuss] Repair inside C*

2024-10-21 Thread David Capwell
> Is there anyway it makes sense for this to be an external process rather than > a new thread pool inside the C* process? One thing to keep in mind is that larger clusters require you “smartly” split the ranges else you nuke your cluster… knowing how to split requires internal knowledge from t

Re: [Discuss] Repair inside C*

2024-10-21 Thread Josh McKenzie
> Is there anyway it makes sense for this to be an external process rather than > a new thread pool inside the C* process? I'm personally more irked by the merkle tree building / streaming / merging / etc resource utilization being in the primary C* process. My intuition is that the *scheduling*

Re: [Discuss] Repair inside C*

2024-10-21 Thread Jeremiah Jordan
I love the idea of a repair service being there by default for an install of C*. My main concern here is that it is putting more services into the main database process. I actually think we should be looking at how we can move things out of the database process. The C* process being a giant mon

Re: [Discuss] Repair inside C*

2024-10-18 Thread Jaydeep Chovatia
Mick, I am highly sorry to mispronounce your name. Indeed, Mick - two additional weeks is not an issue at all. Jaydeep On Fri, Oct 18, 2024 at 1:41 PM Jaydeep Chovatia wrote: > Indeed, Mike - two additional weeks is not an issue at all. > Thanks! > > Jaydeep >

Re: [Discuss] Repair inside C*

2024-10-18 Thread Jaydeep Chovatia
Indeed, Mike - two additional weeks is not an issue at all. Thanks! Jaydeep

Re: [Discuss] Repair inside C*

2024-10-18 Thread Mick Semb Wever
This is looking strong, thanks Jaydeep. I would suggest folk take a look at the design doc and the PR in the CEP. A lot is there (that I have completely missed). I would especially ask all authors of prior art (Reaper, DSE nodesync, ecchronos) to take a final review of the proposal Jaydeep, can

Re: [Discuss] Repair inside C*

2024-10-17 Thread Jaydeep Chovatia
Sorry, there is a typo in the CEP-37 link; here is the correct link On Thu, Oct 17, 2024 at 4:36 PM Jaydeep Chovatia wrote: > First, thank you for your patience while we strengthened the CEP-

Re: [Discuss] Repair inside C*

2024-10-17 Thread Jaydeep Chovatia
First, thank you for your patience while we strengthened the CEP-37. Over the last eight months, Chris Lohfink, Andy Tolbert, Josh McKenzie, Dinesh Joshi, Kristijonas Zalys, and I have done tons of work (online discussions/a dedicated Slack channel #cassandra-repair-scheduling-cep37) to come up w

Re: [Discuss] Repair inside C*

2024-09-19 Thread Josh McKenzie
Not quite; finishing touches on the CEP and design doc are in flight (as of last / this week). Soon(tm). On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote: > Is this CEP ready for a VOTE thread? > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Un

Re: [Discuss] Repair inside C*

2024-09-19 Thread Patrick McFadin
Is this CEP ready for a VOTE thread? https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > Thanks, Josh. I've just updated the CEP >

Re: [Discuss] Repair inside C*

2024-02-25 Thread Jaydeep Chovatia
Thanks, Josh. I've just updated the CEP and included all the solutions you mentioned below. Jaydeep On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie wrote: > Very late response from

Re: [Discuss] Repair inside C*

2024-02-23 Thread Štefan Miklošovič
There are already some community solutions to scheduled repairs like this (1), it runs along Cassandra node though ... anyway. I would like to see that we pick what is the best already out there and try to integrate it rather than trying to figure it all out again. That seems like a waste of time a

Re: [Discuss] Repair inside C*

2024-02-23 Thread Josh McKenzie
> we're all willing to bikeshed for our personal preference on where it lives > and how it's implemented, and at the end of the day, code talks. I don't > think anyone's said they'll die on the hill of implementation details :D I don't think we're going to be able to reach a consensus on an ema

Re: [Discuss] Repair inside C*

2024-02-22 Thread Paulo Motta
Apologies, I just read the previous message and missed the previous discussion on sidecar vs main process on this thread. :-) It does not look like a final agreement was reached about this and there are lots of good arguments for both sides, but perhaps it would be nice to agree on this before a C

Re: [Discuss] Repair inside C*

2024-02-22 Thread Paulo Motta
+1 to Josh's points, The project has considered native repair scheduling for a long time but it was never made a reality due to the complex considerations involved and availability of custom implementations/tools like cassandra-reaper, which is a popular way of scheduling repairs in Cassandra. Un

Re: [Discuss] Repair inside C*

2024-02-22 Thread Josh McKenzie
Very late response from me here (basically necro'ing this thread). I think it'd be useful to get this condensed into a CEP that we can then discuss in that format. It's clearly something we all agree we need and having an implementation that works, even if it's not in your preferred execution d

Re: [Discuss] Repair inside C*

2023-08-24 Thread Jaydeep Chovatia
Is anyone going to file an official CEP for this? As mentioned in this email thread, here is one of the solution's design doc and source code on a private Apache Cassandra patch. Could you

Re: [Discuss] Repair inside C*

2023-08-02 Thread Jon Haddad
> That said I would happily support an effort to bring repair scheduling to the > sidecar immediately. This has nothing blocking it, and would potentially > enable the sidecar to provide an official repair scheduling solution that is > compatible with current or even previous versions of the dat

Re: [Discuss] Repair inside C*

2023-07-27 Thread Josh McKenzie
> The idea that your data integrity needs to be opt-in has never made sense to > me from the perspective of either the product or the end user. I could not agree with this more. 100%. > The current (and past) state of things where running the DB correctly > **requires* *running a separate proces

Re: [Discuss] Repair inside C*

2023-07-26 Thread C. Scott Andreas
I agree that it would be ideal for Cassandra to have a repair scheduler in-DB. That said I would happily support an effort to bring repair scheduling to the sidecar immediately. This has nothing blocking it, and would potentially enable the sidecar to provide an official repair scheduling soluti

Re: [Discuss] Repair inside C*

2023-07-26 Thread Dinesh Joshi
I concur, repair is an intrinsic part of the database and belongs inside it. We can certainly expose a REST control plane API via the sidecar for triggering it on demand, scheduling, etc. That said, there are various implementation of repair scheduling and orchestration that a lot of organizati

Re: [Discuss] Repair inside C*

2023-07-26 Thread Jon Haddad
I'm 100% in favor of repair being part of the core DB, not the sidecar. The current (and past) state of things where running the DB correctly *requires* running a separate process (either community maintained or official C* sidecar) is incredibly painful for folks. The idea that your data inte

Re: [Discuss] Repair inside C*

2023-07-26 Thread David Capwell
ev >>> mailto:dev@cassandra.apache.org>> wrote: >>>> In [2] we suggested that the next step should be a CEP. >>>> >>>> I am happy to lend a hand to this effort as well. >>>> >>>> Thanks Jaydeep and David - really appreciated.

Re: [Discuss] Repair inside C*

2023-07-25 Thread Jeremiah Jordan
appreciated. >>> >>> German >>> >>> -- >>> *From:* David Capwell >>> *Sent:* Tuesday, July 25, 2023 8:32 AM >>> *To:* dev >>> *Cc:* German Eichberger >>> *Subject:* [EXTERNAL] Re: [Di

Re: [Discuss] Repair inside C*

2023-07-25 Thread Chris Lohfink
> >> German >> >> -- >> *From:* David Capwell >> *Sent:* Tuesday, July 25, 2023 8:32 AM >> *To:* dev >> *Cc:* German Eichberger >> *Subject:* [EXTERNAL] Re: [Discuss] Repair inside C* >> >> As someone who has don

Re: [Discuss] Repair inside C*

2023-07-25 Thread Jaydeep Chovatia
effort as well. > > Thanks Jaydeep and David - really appreciated. > > German > > -- > *From:* David Capwell > *Sent:* Tuesday, July 25, 2023 8:32 AM > *To:* dev > *Cc:* German Eichberger > *Subject:* [EXTERNAL] Re: [Discuss] Repair ins

Re: [Discuss] Repair inside C*

2023-07-25 Thread German Eichberger via dev
: [EXTERNAL] Re: [Discuss] Repair inside C* As someone who has done a lot of work trying to make repair stable, I approve of this message ^_^ More than glad to help mentor this work On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia wrote: To clarify the repair solution timing, the one we have listed

Re: [Discuss] Repair inside C*

2023-07-25 Thread David Capwell
As someone who has done a lot of work trying to make repair stable, I approve of this message ^_^ More than glad to help mentor this work > On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia > wrote: > > To clarify the repair solution timing, the one we have listed in the article > is not the rec

Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
To clarify the repair solution timing, the one we have listed in the article is not the recently developed one. We were hitting some high-priority production challenges back in early 2018, and to address that, we developed and rolled out the solution in production in just a few months. The timing-w

Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
Hi German, The goal is always to backport our learnings back to the community. For example, I have already successfully backported the following two enhancements/bug fixes back to the Open Source Cassandra, which are described in the article. I am already currently working on open-source a few mor

[Discuss] Repair inside C*

2023-07-24 Thread German Eichberger via dev
All, We had a brief discussion in [2] about the Uber article [1] where they talk about having integrated repair into Cassandra and how great that is. I expressed my disappointment that they didn't work with the community on that (Uber, if you are listening time to make amends 🙂) and it turns ou