Re: Proposing an Apache Cassandra Management process

Jeff Jirsa Fri, 07 Sep 2018 16:18:02 -0700

How can we continue moving this forward?

Mick/Jon/TLP folks, is there a path here where we commit the
Netflix-provided management process, and you augment Reaper to work with it?
Is there a way we can make a larger umbrella that's modular that can
support either/both?
Does anyone believe there's a clear, objective argument that one is
strictly better than the other? I haven't seen one.




On Mon, Aug 20, 2018 at 4:14 PM Roopa Tangirala
<rtangir...@netflix.com.invalid> wrote:

> +1 to everything that Joey articulated with emphasis on the fact that
> contributions should be evaluated based on the merit of code and their
> value add to the whole offering. I  hope it does not matter whether that
> contribution comes from PMC member or a person who is not a committer. I
> would like the process to be such that it encourages the new members to be
> a part of the community and not shy away from contributing to the code
> assuming their contributions are valued differently than committers or PMC
> members. It would be sad to see the contributions decrease if we go down
> that path.
>
> *Regards,*
>
> *Roopa Tangirala*
>
> Engineering Manager CDE
>
> *(408) 438-3156 - mobile*
>
>
>
>
>
>
> On Mon, Aug 20, 2018 at 2:58 PM Joseph Lynch <joe.e.ly...@gmail.com>
> wrote:
>
> > > We are looking to contribute Reaper to the Cassandra project.
> > >
> > Just to clarify are you proposing contributing Reaper as a project via
> > donation or you are planning on contributing the features of Reaper as
> > patches to Cassandra? If the former how far along are you on the donation
> > process? If the latter, when do you think you would have patches ready
> for
> > consideration / review?
> >
> >
> > > Looking at the patch it's very similar in its base design already, but
> > > Reaper does has a lot more to offer. We have all been working hard to
> > move
> > > it to also being a side-car so it can be contributed. This raises a
> > number
> > > of relevant questions to this thread: would we then accept both works
> in
> > > the Cassandra project, and what burden would it put on the current PMC
> to
> > > maintain both works.
> > >
> > I would hope that we would collaborate on merging the best parts of all
> > into the official Cassandra sidecar, taking the always on, shared
> nothing,
> > highly available system that we've contributed a patchset for and adding
> in
> > many of the repair features (e.g. schedules, a nice web UI) that Reaper
> > has.
> >
> >
> > > I share Stefan's concern that consensus had not been met around a
> > > side-car, and that it was somehow default accepted before a patch
> landed.
> >
> >
> > I feel this is not correct or fair. The sidecar and repair discussions
> have
> > been anything _but_ "default accepted". The timeline of consensus
> building
> > involving the management sidecar and repair scheduling plans:
> >
> > Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper
> to
> > come up with design goals for a repair scheduler that could work at
> Netflix
> > scale.
> >
> > ~Feb 2017: Netflix believes that the fundamental design gaps prevented us
> > from using Reaper as it relies heavily on remote JMX connections and
> > central coordination.
> >
> > Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available
> > and distributed repair scheduling sidecar/tool. He is encouraged by
> > multiple committers to build repair scheduling into the daemon itself and
> > not as a sidecar so the database is truly eventually consistent.
> >
> > ~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback
> at
> > NGCC, Vinay and myself prototype the distributed repair scheduler within
> > Priam and roll it out at Netflix scale.
> >
> > Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page
> > design document for adding repair scheduling to the daemon itself and
> open
> > the design up for feedback from the community. We get feedback from Alex,
> > Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals
> > to contribute Reaper at this point. We hear the consensus that the
> > community would prefer repair scheduling in a separate distributed
> sidecar
> > rather than in the daemon itself and we re-work the design to match this
> > consensus, re-aligning with our original proposal at NGCC.
> >
> > Apr 2018: Blake brings the discussion of repair scheduling to the dev
> list
> > (
> >
> >
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
> > ).
> > Many community members give positive feedback that we should solve it as
> > part of Cassandra and there is still no mention of contributing Reaper at
> > this point. The last message is my attempted summary giving context on
> how
> > we want to take the best of all the sidecars (OpsCenter, Priam, Reaper)
> and
> > ship them with Cassandra.
> >
> > Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design
> document
> > for gathering feedback on a general management sidecar. Sankalp and
> Dinesh
> > encourage Vinay and myself to kickstart that sidecar using the repair
> > scheduler patch
> >
> > Apr 2018: Dinesh reaches out to the dev list (
> >
> >
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
> > )
> > about the general management process to gain further feedback. All
> feedback
> > remains positive as it is a potential place for multiple community
> members
> > to contribute their various sidecar functionality.
> >
> > May-Jul 2017: Vinay and I work on creating a basic sidecar for running
> the
> > repair scheduler based on the feedback from the community in
> > CASSANDRA-14346 and CASSANDRA-14395
> >
> > Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,
> > nobody objects
> >
> > Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone
> > needs review for before 4.0, I mention again that we've nearly got the
> > basic sidecar and repair scheduling work done and will need help with
> > review. No one responds.
> >
> > Aug 2018: We submit a patch that brings a basic distributed sidecar and
> > robust distributed repair to Cassandra itself. Dinesh mentions that he
> will
> > try to review. Now folks appear concerned about it being in tree and
> > instead maybe it should go in a different repo all together. I don't
> think
> > we have consensus on the repo choice yet.
> >
> > This seems at odds when we're already struggling to keep up with the
> > > incoming patches/contributions, and there could be other git repos in
> the
> > > project we will need to support in the future too. But I'm also curious
> > > about the whole "Community over Code" angle to this, how do we
> encourage
> > > multiple external works to collaborate together building value in both
> > the
> > > technical and community.
> > >
> >
> > I viewed this management sidecar as a way for us to stop, as a community,
> > building the same thing over and over again. Netflix maintains Priam,
> Last
> > pickle maintains Reaper, Datastax maintains OpsCenter. Why can't we take
> > the best of Reaper (e.g. schedules, diagnostic events, UI) and leave the
> > worst (e.g. centralized design with lots of locking) and combine it with
> > the best of Priam (robust shared nothing sidecar that makes Cassandra
> > management easy) and leave the worst (a bunch of technical debt), and
> > iterate towards one sidecar that allows Cassandra users to just run their
> > database.
> >
> >
> > > The Reaper project has worked hard in building both its user and
> > > contributor base. And I would have thought these, including having the
> > > contributor base overlap with the C* PMC, were prerequisites before
> > moving
> > > a larger body of work into the project (separate git repo or not). I
> > guess
> > > this isn't so much "Community over Code", but it illustrates a concern
> > > regarding abandoned code when there's no existing track record of
> > > maintaining it as OSS, as opposed to expecting an existing "show, don't
> > > tell" culture. Reaper for example has stronger indicators for ongoing
> > > support and an existing OSS user base: today C* committers having
> > > contributed to Reaper are Jon, Stefan, Nate, and myself, amongst the 40
> > > contributors in total. And we've been making steps to involve it more
> > into
> > > the C* community (eg users ML), without being too presumptuous.
> >
> > I worry about this logic to be frank. Why do significant contributions
> need
> > to come only from established C* PMC members? Shouldn't we strive to
> > consider relative merits of code that has actually been submitted to the
> > project on the basis of the code and not who sent the patches?
> >
> >
> > > On the technical side: Reaper supports (or can easily) all the concerns
> > > that the proposal here raises: distributed nodetool commands,
> > centralising
> > > jmx interfacing, scheduling ops (repairs, snapshots, compactions,
> > cleanups,
> > > etc), monitoring and diagnostics, etc etc. It's designed so that it can
> > be
> > > a single instance, instance-per-datacenter, or side-car (per process).
> > When
> > > there are multiple instances in a datacenter you get HA. You have a
> > choice
> > > of different storage backends (memory, postgres, c*). You can ofc use a
> > > separate C* cluster as a backend so to separate infrastructure data
> from
> > > production data. And it's got an UI for C* Diagnostics already (which
> > > imposes a different jmx interface of polling for events rather than
> > > subscribing to jmx notifications which we know is problematic, thanks
> to
> > > Stefan). Anyway, that's my plug for Reaper :-)
> > >
> > Could we get some of these suggestions into the
> > CASSANDRA-14346/CASSANDRA-14395 jiras and we can debate the technical
> > merits there?
> >
> > There's been little effort in evaluating these two bodies of work, one
> > > which is largely unknown to us, and my concern is how we would fairly
> > > support both going into the future?
> > >
> >
> > > Another option would be that this side-car patch first exists as a
> github
> > > project for a period of time, on par to how Reaper has been. This will
> > help
> > > evaluate its use and to first build up its contributors. This makes it
> > > easier for the C* PMC to choose which projects it would want to
> formally
> > > maintain, and to do so based on factors beyond merits of the technical.
> > We
> > > may even see it converge (or collaborate more) with Reaper, a win for
> > > everyone.
> > >
> > We could have put our distributed repair scheduler as part of Priam ages
> > ago which would have been much easier for us and also has an existing
> > community, but we don't want to because that will encourage the community
> > to remain fractured on the most important management processes. Instead
> we
> > seek to work with the community to take the lessons learned from all the
> > various available sidecars owned by different organizations (Datastax,
> > Netflix, TLP) and fix this once for the whole community. Can we work
> > together to make Cassandra just work for our users out of the box?
> >
> > -Joey
> >
>

Re: Proposing an Apache Cassandra Management process

Reply via email to