Re: Proposing an Apache Cassandra Management process

Jeff Jirsa Fri, 07 Sep 2018 18:12:01 -0700

I’d also like to see the end state you describe: reaper UI wrapping the Netflix 
management process with pluggable scheduling (either as is with reaper now, or 
using the Netflix scheduler), but I don’t think that means we need to start 
with reaper - if personally prefer the opposite direction, starting with 
something small and isolated and layering on top.


-- 
Jeff Jirsa


> On Sep 7, 2018, at 5:42 PM, Blake Eggleston <[email protected]> wrote:
> 
> I think we should accept the reaper project as is and make that cassandra 
> management process 1.0, then integrate the netflix scheduler (and other new 
> features) into that.
> 
> The ultimate goal would be for the netflix scheduler to become the default 
> repair scheduler, but I think using reaper as the starting point makes it 
> easier to get there. 
> 
> Reaper would bring a prod user base that would realistically take 2-3 years 
> to build up with a new project. As an operator, switching to a cassandra 
> management process that’s basically a re-brand of an existing and commonly 
> used management process isn’t super risky. Asking operators to switch to a 
> new process is a much harder sell. 
> 
> On September 7, 2018 at 4:17:10 PM, Jeff Jirsa ([email protected]) wrote:
> 
> How can we continue moving this forward?  
> 
> Mick/Jon/TLP folks, is there a path here where we commit the  
> Netflix-provided management process, and you augment Reaper to work with it?  
> Is there a way we can make a larger umbrella that's modular that can  
> support either/both?  
> Does anyone believe there's a clear, objective argument that one is  
> strictly better than the other? I haven't seen one.  
> 
> 
> 
> On Mon, Aug 20, 2018 at 4:14 PM Roopa Tangirala  
> <[email protected]> wrote:  
> 
>> +1 to everything that Joey articulated with emphasis on the fact that  
>> contributions should be evaluated based on the merit of code and their  
>> value add to the whole offering. I hope it does not matter whether that  
>> contribution comes from PMC member or a person who is not a committer. I  
>> would like the process to be such that it encourages the new members to be  
>> a part of the community and not shy away from contributing to the code  
>> assuming their contributions are valued differently than committers or PMC  
>> members. It would be sad to see the contributions decrease if we go down  
>> that path.  
>> 
>> *Regards,*  
>> 
>> *Roopa Tangirala*  
>> 
>> Engineering Manager CDE  
>> 
>> *(408) 438-3156 - mobile*  
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Aug 20, 2018 at 2:58 PM Joseph Lynch <[email protected]>  
>> wrote:  
>> 
>>>> We are looking to contribute Reaper to the Cassandra project.  
>>>> 
>>> Just to clarify are you proposing contributing Reaper as a project via  
>>> donation or you are planning on contributing the features of Reaper as  
>>> patches to Cassandra? If the former how far along are you on the donation  
>>> process? If the latter, when do you think you would have patches ready  
>> for  
>>> consideration / review?  
>>> 
>>> 
>>>> Looking at the patch it's very similar in its base design already, but  
>>>> Reaper does has a lot more to offer. We have all been working hard to  
>>> move  
>>>> it to also being a side-car so it can be contributed. This raises a  
>>> number  
>>>> of relevant questions to this thread: would we then accept both works  
>> in  
>>>> the Cassandra project, and what burden would it put on the current PMC  
>> to  
>>>> maintain both works.  
>>>> 
>>> I would hope that we would collaborate on merging the best parts of all  
>>> into the official Cassandra sidecar, taking the always on, shared  
>> nothing,  
>>> highly available system that we've contributed a patchset for and adding  
>> in  
>>> many of the repair features (e.g. schedules, a nice web UI) that Reaper  
>>> has.  
>>> 
>>> 
>>>> I share Stefan's concern that consensus had not been met around a  
>>>> side-car, and that it was somehow default accepted before a patch  
>> landed.  
>>> 
>>> 
>>> I feel this is not correct or fair. The sidecar and repair discussions  
>> have  
>>> been anything _but_ "default accepted". The timeline of consensus  
>> building  
>>> involving the management sidecar and repair scheduling plans:  
>>> 
>>> Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper  
>> to  
>>> come up with design goals for a repair scheduler that could work at  
>> Netflix  
>>> scale.  
>>> 
>>> ~Feb 2017: Netflix believes that the fundamental design gaps prevented us  
>>> from using Reaper as it relies heavily on remote JMX connections and  
>>> central coordination.  
>>> 
>>> Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available  
>>> and distributed repair scheduling sidecar/tool. He is encouraged by  
>>> multiple committers to build repair scheduling into the daemon itself and  
>>> not as a sidecar so the database is truly eventually consistent.  
>>> 
>>> ~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback  
>> at  
>>> NGCC, Vinay and myself prototype the distributed repair scheduler within  
>>> Priam and roll it out at Netflix scale.  
>>> 
>>> Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page  
>>> design document for adding repair scheduling to the daemon itself and  
>> open  
>>> the design up for feedback from the community. We get feedback from Alex,  
>>> Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals  
>>> to contribute Reaper at this point. We hear the consensus that the  
>>> community would prefer repair scheduling in a separate distributed  
>> sidecar  
>>> rather than in the daemon itself and we re-work the design to match this  
>>> consensus, re-aligning with our original proposal at NGCC.  
>>> 
>>> Apr 2018: Blake brings the discussion of repair scheduling to the dev  
>> list  
>>> (  
>>> 
>>> 
>> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
>>   
>>> ).  
>>> Many community members give positive feedback that we should solve it as  
>>> part of Cassandra and there is still no mention of contributing Reaper at  
>>> this point. The last message is my attempted summary giving context on  
>> how  
>>> we want to take the best of all the sidecars (OpsCenter, Priam, Reaper)  
>> and  
>>> ship them with Cassandra.  
>>> 
>>> Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design  
>> document  
>>> for gathering feedback on a general management sidecar. Sankalp and  
>> Dinesh  
>>> encourage Vinay and myself to kickstart that sidecar using the repair  
>>> scheduler patch  
>>> 
>>> Apr 2018: Dinesh reaches out to the dev list (  
>>> 
>>> 
>> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
>>   
>>> )  
>>> about the general management process to gain further feedback. All  
>> feedback  
>>> remains positive as it is a potential place for multiple community  
>> members  
>>> to contribute their various sidecar functionality.  
>>> 
>>> May-Jul 2017: Vinay and I work on creating a basic sidecar for running  
>> the  
>>> repair scheduler based on the feedback from the community in  
>>> CASSANDRA-14346 and CASSANDRA-14395  
>>> 
>>> Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,  
>>> nobody objects  
>>> 
>>> Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone  
>>> needs review for before 4.0, I mention again that we've nearly got the  
>>> basic sidecar and repair scheduling work done and will need help with  
>>> review. No one responds.  
>>> 
>>> Aug 2018: We submit a patch that brings a basic distributed sidecar and  
>>> robust distributed repair to Cassandra itself. Dinesh mentions that he  
>> will  
>>> try to review. Now folks appear concerned about it being in tree and  
>>> instead maybe it should go in a different repo all together. I don't  
>> think  
>>> we have consensus on the repo choice yet.  
>>> 
>>> This seems at odds when we're already struggling to keep up with the  
>>>> incoming patches/contributions, and there could be other git repos in  
>> the  
>>>> project we will need to support in the future too. But I'm also curious  
>>>> about the whole "Community over Code" angle to this, how do we  
>> encourage  
>>>> multiple external works to collaborate together building value in both  
>>> the  
>>>> technical and community.  
>>>> 
>>> 
>>> I viewed this management sidecar as a way for us to stop, as a community,  
>>> building the same thing over and over again. Netflix maintains Priam,  
>> Last  
>>> pickle maintains Reaper, Datastax maintains OpsCenter. Why can't we take  
>>> the best of Reaper (e.g. schedules, diagnostic events, UI) and leave the  
>>> worst (e.g. centralized design with lots of locking) and combine it with  
>>> the best of Priam (robust shared nothing sidecar that makes Cassandra  
>>> management easy) and leave the worst (a bunch of technical debt), and  
>>> iterate towards one sidecar that allows Cassandra users to just run their  
>>> database.  
>>> 
>>> 
>>>> The Reaper project has worked hard in building both its user and  
>>>> contributor base. And I would have thought these, including having the  
>>>> contributor base overlap with the C* PMC, were prerequisites before  
>>> moving  
>>>> a larger body of work into the project (separate git repo or not). I  
>>> guess  
>>>> this isn't so much "Community over Code", but it illustrates a concern  
>>>> regarding abandoned code when there's no existing track record of  
>>>> maintaining it as OSS, as opposed to expecting an existing "show, don't  
>>>> tell" culture. Reaper for example has stronger indicators for ongoing  
>>>> support and an existing OSS user base: today C* committers having  
>>>> contributed to Reaper are Jon, Stefan, Nate, and myself, amongst the 40  
>>>> contributors in total. And we've been making steps to involve it more  
>>> into  
>>>> the C* community (eg users ML), without being too presumptuous.  
>>> 
>>> I worry about this logic to be frank. Why do significant contributions  
>> need  
>>> to come only from established C* PMC members? Shouldn't we strive to  
>>> consider relative merits of code that has actually been submitted to the  
>>> project on the basis of the code and not who sent the patches?  
>>> 
>>> 
>>>> On the technical side: Reaper supports (or can easily) all the concerns  
>>>> that the proposal here raises: distributed nodetool commands,  
>>> centralising  
>>>> jmx interfacing, scheduling ops (repairs, snapshots, compactions,  
>>> cleanups,  
>>>> etc), monitoring and diagnostics, etc etc. It's designed so that it can  
>>> be  
>>>> a single instance, instance-per-datacenter, or side-car (per process).  
>>> When  
>>>> there are multiple instances in a datacenter you get HA. You have a  
>>> choice  
>>>> of different storage backends (memory, postgres, c*). You can ofc use a  
>>>> separate C* cluster as a backend so to separate infrastructure data  
>> from  
>>>> production data. And it's got an UI for C* Diagnostics already (which  
>>>> imposes a different jmx interface of polling for events rather than  
>>>> subscribing to jmx notifications which we know is problematic, thanks  
>> to  
>>>> Stefan). Anyway, that's my plug for Reaper :-)  
>>>> 
>>> Could we get some of these suggestions into the  
>>> CASSANDRA-14346/CASSANDRA-14395 jiras and we can debate the technical  
>>> merits there?  
>>> 
>>> There's been little effort in evaluating these two bodies of work, one  
>>>> which is largely unknown to us, and my concern is how we would fairly  
>>>> support both going into the future?  
>>>> 
>>> 
>>>> Another option would be that this side-car patch first exists as a  
>> github  
>>>> project for a period of time, on par to how Reaper has been. This will  
>>> help  
>>>> evaluate its use and to first build up its contributors. This makes it  
>>>> easier for the C* PMC to choose which projects it would want to  
>> formally  
>>>> maintain, and to do so based on factors beyond merits of the technical.  
>>> We  
>>>> may even see it converge (or collaborate more) with Reaper, a win for  
>>>> everyone.  
>>>> 
>>> We could have put our distributed repair scheduler as part of Priam ages  
>>> ago which would have been much easier for us and also has an existing  
>>> community, but we don't want to because that will encourage the community  
>>> to remain fractured on the most important management processes. Instead  
>> we  
>>> seek to work with the community to take the lessons learned from all the  
>>> various available sidecars owned by different organizations (Datastax,  
>>> Netflix, TLP) and fix this once for the whole community. Can we work  
>>> together to make Cassandra just work for our users out of the box?  
>>> 
>>> -Joey  
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Proposing an Apache Cassandra Management process

Reply via email to