Re: Proposing an Apache Cassandra Management process

dinesh.jo...@yahoo.com.INVALID Fri, 21 Sep 2018 15:32:53 -0700

I have created a sub-task - CASSANDRA-14783. Could we get some feedback before 
we begin implementing anything?


Dinesh 

    On Thursday, September 20, 2018, 11:22:33 PM PDT, Dinesh Joshi 
<dinesh.jo...@yahoo.com.INVALID> wrote:  
 
 I have updated the doc with a short paragraph providing the clarification. 
Sankalp's suggestion is already part of the doc. If there aren't further 
objections could we move this discussion over to the jira (CASSANDRA-14395)?

Dinesh

> On Sep 18, 2018, at 10:31 AM, sankalp kohli <kohlisank...@gmail.com> wrote:
> 
> How about we start with a few basic features in side car. How about starting 
> with this
> 1. Bulk nodetool commands: User can curl any sidecar and be able to run a 
> nodetool command in bulk across the cluster. 
> <sidecar>:<port>/bulk/nodetool/tablestats?arg0=keyspace_name.table_name&arg1=<if
>  required>
> 
> And later 
> 2: Health checks. 
> 
> On Thu, Sep 13, 2018 at 11:34 AM dinesh.jo...@yahoo.com.INVALID 
> <dinesh.jo...@yahoo.com.invalid> wrote:
> I will update the document to add that point. The document did not mean to 
> serve as a design or architectural document but rather something that would 
> spark a discussion on the idea.
> Dinesh 
> 
>    On Thursday, September 13, 2018, 10:59:34 AM PDT, Jonathan Haddad 
><j...@jonhaddad.com <mailto:j...@jonhaddad.com>> wrote:  
> 
>  Most of the discussion and work was done off the mailing list - there's a
> big risk involved when folks disappear for months at a time and resurface
> with big pile of code plus an agenda that you failed to loop everyone in
> on. In addition, by your own words the design document didn't accurately
> describe what was being built.  I don't write this to try to argue about
> it, I just want to put some perspective for those of us that weren't part
> of this discussion on a weekly basis over the last several months.  Going
> forward let's keep things on the ML so we can avoid confusion and
> frustration for all parties.
> 
> With that said - I think Blake made a really good point here and it's
> helped me understand the scope of what's being built better.  Looking at it
> from a different perspective it doesn't seem like there's as much overlap
> as I had initially thought.  There's the machinery that runs certain tasks
> (what Joey has been working on) and the user facing side of exposing that
> information in management tool.
> 
> I do appreciate (and like) the idea of not trying to boil the ocean, and
> working on things incrementally.  Putting a thin layer on top of Cassandra
> that can perform cluster wide tasks does give us an opportunity to move in
> the direction of a general purpose user-facing admin tool without
> committing to trying to write the full stack all at once (or even make
> decisions on it now).  We do need a sensible way of doing rolling restarts
> / scrubs / scheduling and Reaper wasn't built for that, and even though we
> can add it I'm not sure if it's the best mechanism for the long term.
> 
> So if your goal is to add maturity to the project by making cluster wide
> tasks easier by providing a framework to build on top of, I'm in favor of
> that and I don't see it as antithetical to what I had in mind with Reaper.
> Rather, the two are more complementary than I had originally realized.
> 
> Jon
> 
> 
> 
> 
> On Thu, Sep 13, 2018 at 10:39 AM dinesh.jo...@yahoo.com.INVALID
> <dinesh.jo...@yahoo.com <mailto:dinesh.jo...@yahoo.com>.invalid> wrote:
> 
> > I have a few clarifications -
> > The scope of the management process is not to simply run repair
> > scheduling. Repair scheduling is one of the many features we could
> > implement or adopt from existing sources. So could we please split the
> > Management Process discussion and the repair scheduling?
> > After re-reading the management process proposal, I see we missed to
> > communicate a basic idea in the document. We wanted to take a pluggable
> > approach to various activities that the management process could perform.
> > This could accommodate different implementations of common activities such
> > as repair. The management process would provide the basic framework and it
> > would have default implementations for some of the basic activities. This
> > would allow for speedier iteration cycles and keep things extensible.
> > Turning to some questions that Jon and others have raised, when I +1, my
> > intention is to fully contribute and stay with this community. That said,
> > things feel rushed for some but for me it feels like analysis paralysis.
> > We're looking for actionable feedback and to discuss the management process
> > _not_ repair scheduling solutions.
> > Thanks,
> > Dinesh
> >
> >
> >
> > On Sep 12, 2018, at 6:24 PM, sankalp kohli <kohlisank...@gmail.com 
> > <mailto:kohlisank...@gmail.com>> wrote:
> > Here is a list of open discussion points from the voting thread. I think
> > some are already answered but I will still gather these questions here.
> >
> > From several people:
> > 1. Vote is rushed and we need more time for discussion.
> >
> > From Sylvain
> > 2. About the voting process...I think that was addressed by Jeff Jirsa and
> > deserves a separate thread as it is not directly related to this thread.
> > 3. Does the project need a side car.
> >
> > From Jonathan Haddad
> > 4. Are people doing +1 willing to contribute
> >
> > From Jonathan Ellis
> > 5. List of feature set, maturity, maintainer availability from Reaper or
> > any other project being donated.
> >
> > Mick Semb Wever
> > 6. We should not vote on these things and instead build consensus.
> >
> > Open Questions from this thread
> > 7. What technical debts we are talking about in Reaper. Can someone give
> > concrete examples.
> > 8. What is the timeline of donating Reaper to Apache Cassandra.
> >
> > On Wed, Sep 12, 2018 at 3:49 PM sankalp kohli <kohlisank...@gmail.com 
> > <mailto:kohlisank...@gmail.com>>
> > wrote:
> >
> >
> > (Using this thread and not the vote thread intentionally)
> > For folks talking about vote being rushed. I would use the email from
> > Joseph to show this is not rushed. There was no email on this thread for 4
> > months until I pinged.
> >
> >
> > Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper to
> > come up with design goals for a repair scheduler that could work at Netflix
> > scale.
> >
> > ~Feb 2017: Netflix believes that the fundamental design gaps prevented us
> > from using Reaper as it relies heavily on remote JMX connections and
> > central coordination.
> >
> > Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available
> > and distributed repair scheduling sidecar/tool. He is encouraged by
> > multiple committers to build repair scheduling into the daemon itself and
> > not as a sidecar so the database is truly eventually consistent.
> >
> > ~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback at
> > NGCC, Vinay and myself prototype the distributed repair scheduler within
> > Priam and roll it out at Netflix scale.
> >
> > Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page
> > design document for adding repair scheduling to the daemon itself and open
> > the design up for feedback from the community. We get feedback from Alex,
> > Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals
> > to contribute Reaper at this point. We hear the consensus that the
> > community would prefer repair scheduling in a separate distributed sidecar
> > rather than in the daemon itself and we re-work the design to match this
> > consensus, re-aligning with our original proposal at NGCC.
> >
> > Apr 2018: Blake brings the discussion of repair scheduling to the dev list
> > (
> >
> >
> > https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> >  
> > <https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E>
> > ).
> > Many community members give positive feedback that we should solve it as
> > part of Cassandra and there is still no mention of contributing Reaper at
> > this point. The last message is my attempted summary giving context on how
> > we want to take the best of all the sidecars (OpsCenter, Priam, Reaper) and
> > ship them with Cassandra.
> >
> > Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design document
> > for gathering feedback on a general management sidecar. Sankalp and Dinesh
> > encourage Vinay and myself to kickstart that sidecar using the repair
> > scheduler patch
> >
> > Apr 2018: Dinesh reaches out to the dev list (
> >
> >
> > https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> >  
> > <https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E>
> > )
> > about the general management process to gain further feedback. All feedback
> > remains positive as it is a potential place for multiple community members
> > to contribute their various sidecar functionality.
> >
> > May-Jul 2017: Vinay and I work on creating a basic sidecar for running the
> > repair scheduler based on the feedback from the community in
> > CASSANDRA-14346 and CASSANDRA-14395
> >
> > Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,
> > nobody objects
> >
> > Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone
> > needs review for before 4.0, I mention again that we've nearly got the
> > basic sidecar and repair scheduling work done and will need help with
> > review. No one responds.
> >
> > Aug 2018: We submit a patch that brings a basic distributed sidecar and
> > robust distributed repair to Cassandra itself. Dinesh mentions that he will
> > try to review. Now folks appear concerned about it being in tree and
> > instead maybe it should go in a different repo all together. I don't think
> > we have consensus on the repo choice yet.
> >
> > On Sun, Sep 9, 2018 at 9:13 AM sankalp kohli <kohlisank...@gmail.com 
> > <mailto:kohlisank...@gmail.com>>
> > wrote:
> >
> >
> > I agree with Jon and I think folks who are talking about tech debts in
> > Reaper should elaborate with examples about these tech debts. Can we be
> > more precise and list them down? I see it spread out over this long email
> > thread!!
> >
> > On Sun, Sep 9, 2018 at 6:29 AM Elliott Sims <elli...@backblaze.com 
> > <mailto:elli...@backblaze.com>>
> > wrote:
> >
> >
> > A big one to add to your list there, IMO as a user:
> > * API for determining detailed repair state (and history?).  Essentially,
> > something beyond just "Is some sort of repair running?" so that tools
> > like
> > Reaper can parallelize better.
> >
> > On Sun, Sep 9, 2018 at 8:30 AM, Stefan Podkowinski <s...@apache.org 
> > <mailto:s...@apache.org>>
> > wrote:
> >
> >
> > Does it have to be a single project with functionality provided by
> > multiple plugins? Designing a plugin API at this point seems to be a
> >
> > bit
> >
> > early and comes with additional complexity around managing plugins in
> > general.
> >
> > I was more thinking into the direction of: "what can we do to enable
> > people to create any kind of side car or tooling solution?". Thinks
> >
> > like:
> >
> >
> > Common cluster discovery and management API
> > * Detect local Cassandra processes
> > * Discover and receive events on cluster topology
> > * Get assigned tokens for nodes
> > * Read node configuration
> > * Health checks (as already proposed)
> >
> > Any side cars should be easy to install on nodes that already run
> >
> > Cassandra
> >
> > * Scripts for packaging (tar, deb, rpm)
> > * Templates for systemd support, optionally with auto-startup
> >
> > dependency
> >
> > on the Cassandra main process
> >
> > Integration testing
> > * Provide basic testing framework for mocking cluster state and
> >
> > messages
> >
> >
> > Support for other languages / avoid having to use JMX
> > * JMX bridge (HTTP? gRPC?, already implemented in #14346?)
> >
> > Obviously the whole side car discussion is not moving into a direction
> > everyone's happy with. Would it be an option to take a step back and
> > start implementing such a tooling framework with scripts and libraries
> > for the features described above, as a small GitHub project, instead of
> > putting an existing side-car solution up for vote? If that would work
> > and we get people collaborating on code shared between existing
> > side-cars, then we could take the next step and think about either
> > revisit the "official Cassandra side-car" topic, or add the created
> > client tooling framework as official sub-project to the Cassandra
> > project (maybe via Apache incubator).
> >
> >
> > On 08.09.18 02:49, Joseph Lynch wrote:
> >
> > On Fri, Sep 7, 2018 at 5:03 PM Jonathan Haddad <j...@jonhaddad.com 
> > <mailto:j...@jonhaddad.com>>
> >
> > wrote:
> >
> >
> > We haven’t even defined any requirements for an admin tool. It’s
> >
> >
> >
> > hard to
> >
> >
> >
> > make a case for anything without agreement on what we’re trying to
> >
> >
> > build.
> >
> >
> >
> >
> > We were/are trying to sketch out scope/requirements in the #14395 and
> > #14346 tickets as well as their associated design documents. I think
> > the general proposed direction is a distributed 1:1 management
> >
> >
> > sidecar
> >
> >
> > process similar in architecture to Netflix's Priam except explicitly
> > built to be general and pluggable by anyone rather than tightly
> > coupled to AWS.
> >
> > Dinesh, Vinay and I were aiming for low amounts of scope at first and
> > take things in an iterative approach with just enough upfront design
> > but not so much we are unable to make any progress at all. For
> >
> >
> > example
> >
> >
> > maybe something like:
> >
> > 1. Get a super simple and non controversial sidecar process that
> >
> >
> > ships
> >
> >
> > with Cassandra and exposes a lightweight HTTP interface to e.g. some
> > basic JMX endpoints
> > 2a. Add a pluggable execution engine for cron/oneshot/scheduled jobs
> > with the basic interfaces and state store and such
> > 2b. Start scoping and implementing the full HTTP interface, e.g.
> > backup status, cluster health status, etc ...
> > 3a. Start integrating implementations of the jobs from 2a such as
> > snapshot, backup, cluster restart, daemon + sstable upgrade, repair,
> > etc
> > 3b. Start integrating UI components that pair with the HTTP interface
> >
> > from 2b
> >
> > 4. ?? Perhaps start unlocking next generation operations like moving
> > "background" activities like compaction, streaming, repair etc into
> > one or more sidecar contained processes to ensure the main daemon
> >
> >
> > only
> >
> >
> > handles read+write requests
> >
> > There are going to be a lot of questions to answer, and I think
> >
> >
> > trying
> >
> >
> > to answer them all up front will mean that we get nowhere or make
> > unfortunate compromises that cripple the project from the start. If
> > people think we need to do more design and discussion than we have
> > been doing then we can spend more time on the design, but personally
> > I'd rather start iterating on code and prove value incrementally. If
> > it doesn't work out we won't release it GA to the community ...
> >
> > -Joey
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org 
> > <mailto:dev-unsubscr...@cassandra.apache.org>
> > For additional commands, e-mail: dev-h...@cassandra.apache.org 
> > <mailto:dev-h...@cassandra.apache.org>
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org 
> > <mailto:dev-unsubscr...@cassandra.apache.org>
> > For additional commands, e-mail: dev-h...@cassandra.apache.org 
> > <mailto:dev-h...@cassandra.apache.org>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> -- 
> Jon Haddad
> http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/>
> twitter: rustyrazorblade

Re: Proposing an Apache Cassandra Management process

Reply via email to