I have created a sub-task - CASSANDRA-14783. Could we get some feedback before we begin implementing anything?
Dinesh On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi <dinesh.jo...@yahoo.com.INVALID> wrote: I have updated the doc with a short paragraph providing the clarification. Sankalp's suggestion is already part of the doc. If there aren't further objections could we move this discussion over to the jira (CASSANDRA-14395)? Dinesh > On Sep 18, 2018, at 10:31 AM, sankalp kohli <kohlisank...@gmail.com> wrote: > > How about we start with a few basic features in side car. How about starting > with this > 1. Bulk nodetool commands: User can curl any sidecar and be able to run a > nodetool command in bulk across the cluster. > <sidecar>:<port>/bulk/nodetool/tablestats?arg0=keyspace_name.table_name&arg1=<if > required> > > And later > 2: Health checks. > > On Thu, Sep 13, 2018 at 11:34 AM dinesh.jo...@yahoo.com.INVALID > <dinesh.jo...@yahoo.com.invalid> wrote: > I will update the document to add that point. The document did not mean to > serve as a design or architectural document but rather something that would > spark a discussion on the idea. > Dinesh > > On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad ><j...@jonhaddad.com <mailto:j...@jonhaddad.com>> wrote: > > Most of the discussion and work was done off the mailing list - there's a > big risk involved when folks disappear for months at a time and resurface > with big pile of code plus an agenda that you failed to loop everyone in > on. In addition, by your own words the design document didn't accurately > describe what was being built. I don't write this to try to argue about > it, I just want to put some perspective for those of us that weren't part > of this discussion on a weekly basis over the last several months. Going > forward let's keep things on the ML so we can avoid confusion and > frustration for all parties. > > With that said - I think Blake made a really good point here and it's > helped me understand the scope of what's being built better. Looking at it > from a different perspective it doesn't seem like there's as much overlap > as I had initially thought. There's the machinery that runs certain tasks > (what Joey has been working on) and the user facing side of exposing that > information in management tool. > > I do appreciate (and like) the idea of not trying to boil the ocean, and > working on things incrementally. Putting a thin layer on top of Cassandra > that can perform cluster wide tasks does give us an opportunity to move in > the direction of a general purpose user-facing admin tool without > committing to trying to write the full stack all at once (or even make > decisions on it now). We do need a sensible way of doing rolling restarts > / scrubs / scheduling and Reaper wasn't built for that, and even though we > can add it I'm not sure if it's the best mechanism for the long term. > > So if your goal is to add maturity to the project by making cluster wide > tasks easier by providing a framework to build on top of, I'm in favor of > that and I don't see it as antithetical to what I had in mind with Reaper. > Rather, the two are more complementary than I had originally realized. > > Jon > > > > > On Thu, Sep 13, 2018 at 10:39 AM dinesh.jo...@yahoo.com.INVALID > <dinesh.jo...@yahoo.com <mailto:dinesh.jo...@yahoo.com>.invalid> wrote: > > > I have a few clarifications - > > The scope of the management process is not to simply run repair > > scheduling. Repair scheduling is one of the many features we could > > implement or adopt from existing sources. So could we please split the > > Management Process discussion and the repair scheduling? > > After re-reading the management process proposal, I see we missed to > > communicate a basic idea in the document. We wanted to take a pluggable > > approach to various activities that the management process could perform. > > This could accommodate different implementations of common activities such > > as repair. The management process would provide the basic framework and it > > would have default implementations for some of the basic activities. This > > would allow for speedier iteration cycles and keep things extensible. > > Turning to some questions that Jon and others have raised, when I +1, my > > intention is to fully contribute and stay with this community. That said, > > things feel rushed for some but for me it feels like analysis paralysis. > > We're looking for actionable feedback and to discuss the management process > > _not_ repair scheduling solutions. > > Thanks, > > Dinesh > > > > > > > > On Sep 12, 2018, at 6:24 PM, sankalp kohli <kohlisank...@gmail.com > > <mailto:kohlisank...@gmail.com>> wrote: > > Here is a list of open discussion points from the voting thread. I think > > some are already answered but I will still gather these questions here. > > > > From several people: > > 1. Vote is rushed and we need more time for discussion. > > > > From Sylvain > > 2. About the voting process...I think that was addressed by Jeff Jirsa and > > deserves a separate thread as it is not directly related to this thread. > > 3. Does the project need a side car. > > > > From Jonathan Haddad > > 4. Are people doing +1 willing to contribute > > > > From Jonathan Ellis > > 5. List of feature set, maturity, maintainer availability from Reaper or > > any other project being donated. > > > > Mick Semb Wever > > 6. We should not vote on these things and instead build consensus. > > > > Open Questions from this thread > > 7. What technical debts we are talking about in Reaper. Can someone give > > concrete examples. > > 8. What is the timeline of donating Reaper to Apache Cassandra. > > > > On Wed, Sep 12, 2018 at 3:49 PM sankalp kohli <kohlisank...@gmail.com > > <mailto:kohlisank...@gmail.com>> > > wrote: > > > > > > (Using this thread and not the vote thread intentionally) > > For folks talking about vote being rushed. I would use the email from > > Joseph to show this is not rushed. There was no email on this thread for 4 > > months until I pinged. > > > > > > Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper to > > come up with design goals for a repair scheduler that could work at Netflix > > scale. > > > > ~Feb 2017: Netflix believes that the fundamental design gaps prevented us > > from using Reaper as it relies heavily on remote JMX connections and > > central coordination. > > > > Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available > > and distributed repair scheduling sidecar/tool. He is encouraged by > > multiple committers to build repair scheduling into the daemon itself and > > not as a sidecar so the database is truly eventually consistent. > > > > ~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback at > > NGCC, Vinay and myself prototype the distributed repair scheduler within > > Priam and roll it out at Netflix scale. > > > > Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page > > design document for adding repair scheduling to the daemon itself and open > > the design up for feedback from the community. We get feedback from Alex, > > Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals > > to contribute Reaper at this point. We hear the consensus that the > > community would prefer repair scheduling in a separate distributed sidecar > > rather than in the daemon itself and we re-work the design to match this > > consensus, re-aligning with our original proposal at NGCC. > > > > Apr 2018: Blake brings the discussion of repair scheduling to the dev list > > ( > > > > > > https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E > > > > <https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E> > > ). > > Many community members give positive feedback that we should solve it as > > part of Cassandra and there is still no mention of contributing Reaper at > > this point. The last message is my attempted summary giving context on how > > we want to take the best of all the sidecars (OpsCenter, Priam, Reaper) and > > ship them with Cassandra. > > > > Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design document > > for gathering feedback on a general management sidecar. Sankalp and Dinesh > > encourage Vinay and myself to kickstart that sidecar using the repair > > scheduler patch > > > > Apr 2018: Dinesh reaches out to the dev list ( > > > > > > https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E > > > > <https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E> > > ) > > about the general management process to gain further feedback. All feedback > > remains positive as it is a potential place for multiple community members > > to contribute their various sidecar functionality. > > > > May-Jul 2017: Vinay and I work on creating a basic sidecar for running the > > repair scheduler based on the feedback from the community in > > CASSANDRA-14346 and CASSANDRA-14395 > > > > Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this, > > nobody objects > > > > Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone > > needs review for before 4.0, I mention again that we've nearly got the > > basic sidecar and repair scheduling work done and will need help with > > review. No one responds. > > > > Aug 2018: We submit a patch that brings a basic distributed sidecar and > > robust distributed repair to Cassandra itself. Dinesh mentions that he will > > try to review. Now folks appear concerned about it being in tree and > > instead maybe it should go in a different repo all together. I don't think > > we have consensus on the repo choice yet. > > > > On Sun, Sep 9, 2018 at 9:13 AM sankalp kohli <kohlisank...@gmail.com > > <mailto:kohlisank...@gmail.com>> > > wrote: > > > > > > I agree with Jon and I think folks who are talking about tech debts in > > Reaper should elaborate with examples about these tech debts. Can we be > > more precise and list them down? I see it spread out over this long email > > thread!! > > > > On Sun, Sep 9, 2018 at 6:29 AM Elliott Sims <elli...@backblaze.com > > <mailto:elli...@backblaze.com>> > > wrote: > > > > > > A big one to add to your list there, IMO as a user: > > * API for determining detailed repair state (and history?). Essentially, > > something beyond just "Is some sort of repair running?" so that tools > > like > > Reaper can parallelize better. > > > > On Sun, Sep 9, 2018 at 8:30 AM, Stefan Podkowinski <s...@apache.org > > <mailto:s...@apache.org>> > > wrote: > > > > > > Does it have to be a single project with functionality provided by > > multiple plugins? Designing a plugin API at this point seems to be a > > > > bit > > > > early and comes with additional complexity around managing plugins in > > general. > > > > I was more thinking into the direction of: "what can we do to enable > > people to create any kind of side car or tooling solution?". Thinks > > > > like: > > > > > > Common cluster discovery and management API > > * Detect local Cassandra processes > > * Discover and receive events on cluster topology > > * Get assigned tokens for nodes > > * Read node configuration > > * Health checks (as already proposed) > > > > Any side cars should be easy to install on nodes that already run > > > > Cassandra > > > > * Scripts for packaging (tar, deb, rpm) > > * Templates for systemd support, optionally with auto-startup > > > > dependency > > > > on the Cassandra main process > > > > Integration testing > > * Provide basic testing framework for mocking cluster state and > > > > messages > > > > > > Support for other languages / avoid having to use JMX > > * JMX bridge (HTTP? gRPC?, already implemented in #14346?) > > > > Obviously the whole side car discussion is not moving into a direction > > everyone's happy with. Would it be an option to take a step back and > > start implementing such a tooling framework with scripts and libraries > > for the features described above, as a small GitHub project, instead of > > putting an existing side-car solution up for vote? If that would work > > and we get people collaborating on code shared between existing > > side-cars, then we could take the next step and think about either > > revisit the "official Cassandra side-car" topic, or add the created > > client tooling framework as official sub-project to the Cassandra > > project (maybe via Apache incubator). > > > > > > On 08.09.18 02:49, Joseph Lynch wrote: > > > > On Fri, Sep 7, 2018 at 5:03 PM Jonathan Haddad <j...@jonhaddad.com > > <mailto:j...@jonhaddad.com>> > > > > wrote: > > > > > > We haven’t even defined any requirements for an admin tool. It’s > > > > > > > > hard to > > > > > > > > make a case for anything without agreement on what we’re trying to > > > > > > build. > > > > > > > > > > We were/are trying to sketch out scope/requirements in the #14395 and > > #14346 tickets as well as their associated design documents. I think > > the general proposed direction is a distributed 1:1 management > > > > > > sidecar > > > > > > process similar in architecture to Netflix's Priam except explicitly > > built to be general and pluggable by anyone rather than tightly > > coupled to AWS. > > > > Dinesh, Vinay and I were aiming for low amounts of scope at first and > > take things in an iterative approach with just enough upfront design > > but not so much we are unable to make any progress at all. For > > > > > > example > > > > > > maybe something like: > > > > 1. Get a super simple and non controversial sidecar process that > > > > > > ships > > > > > > with Cassandra and exposes a lightweight HTTP interface to e.g. some > > basic JMX endpoints > > 2a. Add a pluggable execution engine for cron/oneshot/scheduled jobs > > with the basic interfaces and state store and such > > 2b. Start scoping and implementing the full HTTP interface, e.g. > > backup status, cluster health status, etc ... > > 3a. Start integrating implementations of the jobs from 2a such as > > snapshot, backup, cluster restart, daemon + sstable upgrade, repair, > > etc > > 3b. Start integrating UI components that pair with the HTTP interface > > > > from 2b > > > > 4. ?? Perhaps start unlocking next generation operations like moving > > "background" activities like compaction, streaming, repair etc into > > one or more sidecar contained processes to ensure the main daemon > > > > > > only > > > > > > handles read+write requests > > > > There are going to be a lot of questions to answer, and I think > > > > > > trying > > > > > > to answer them all up front will mean that we get nowhere or make > > unfortunate compromises that cripple the project from the start. If > > people think we need to do more design and discussion than we have > > been doing then we can spend more time on the design, but personally > > I'd rather start iterating on code and prove value incrementally. If > > it doesn't work out we won't release it GA to the community ... > > > > -Joey > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > <mailto:dev-unsubscr...@cassandra.apache.org> > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > <mailto:dev-h...@cassandra.apache.org> > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > <mailto:dev-unsubscr...@cassandra.apache.org> > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > <mailto:dev-h...@cassandra.apache.org> > > > > > > > > > > > > > > > > > > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/> > twitter: rustyrazorblade