Re: Proposing an Apache Cassandra Management process

dinesh.jo...@yahoo.com.INVALID Thu, 13 Sep 2018 10:39:44 -0700

I have a few clarifications -
The scope of the management process is not to simply run repair scheduling. 
Repair scheduling is one of the many features we could implement or adopt from 
existing sources. So could we please split the Management Process discussion 
and the repair scheduling?
After re-reading the management process proposal, I see we missed to 
communicate a basic idea in the document. We wanted to take a pluggable 
approach to various activities that the management process could perform. This 
could accommodate different implementations of common activities such as 
repair. The management process would provide the basic framework and it would 
have default implementations for some of the basic activities. This would allow 
for speedier iteration cycles and keep things extensible.
Turning to some questions that Jon and others have raised, when I +1, my 
intention is to fully contribute and stay with this community. That said, 
things feel rushed for some but for me it feels like analysis paralysis. We're 
looking for actionable feedback and to discuss the management process _not_ 
repair scheduling solutions.
Thanks,
Dinesh

On Sep 12, 2018, at 6:24 PM, sankalp kohli <kohlisank...@gmail.com> wrote:
Here is a list of open discussion points from the voting thread. I think
some are already answered but I will still gather these questions here.

>From several people:
1. Vote is rushed and we need more time for discussion.

>From Sylvain
2. About the voting process...I think that was addressed by Jeff Jirsa and
deserves a separate thread as it is not directly related to this thread.
3. Does the project need a side car.

>From Jonathan Haddad
4. Are people doing +1 willing to contribute

>From Jonathan Ellis
5. List of feature set, maturity, maintainer availability from Reaper or
any other project being donated.

Mick Semb Wever
6. We should not vote on these things and instead build consensus.

Open Questions from this thread
7. What technical debts we are talking about in Reaper. Can someone give
concrete examples.
8. What is the timeline of donating Reaper to Apache Cassandra.

On Wed, Sep 12, 2018 at 3:49 PM sankalp kohli <kohlisank...@gmail.com>
wrote:

(Using this thread and not the vote thread intentionally)
For folks talking about vote being rushed. I would use the email from
Joseph to show this is not rushed. There was no email on this thread for 4
months until I pinged.

Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper to
come up with design goals for a repair scheduler that could work at Netflix
scale.

~Feb 2017: Netflix believes that the fundamental design gaps prevented us
from using Reaper as it relies heavily on remote JMX connections and
central coordination.

Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available
and distributed repair scheduling sidecar/tool. He is encouraged by
multiple committers to build repair scheduling into the daemon itself and
not as a sidecar so the database is truly eventually consistent.

~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback at
NGCC, Vinay and myself prototype the distributed repair scheduler within
Priam and roll it out at Netflix scale.

Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page
design document for adding repair scheduling to the daemon itself and open
the design up for feedback from the community. We get feedback from Alex,
Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals
to contribute Reaper at this point. We hear the consensus that the
community would prefer repair scheduling in a separate distributed sidecar
rather than in the daemon itself and we re-work the design to match this
consensus, re-aligning with our original proposal at NGCC.

Apr 2018: Blake brings the discussion of repair scheduling to the dev list
(

https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
).
Many community members give positive feedback that we should solve it as
part of Cassandra and there is still no mention of contributing Reaper at
this point. The last message is my attempted summary giving context on how
we want to take the best of all the sidecars (OpsCenter, Priam, Reaper) and
ship them with Cassandra.

Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design document
for gathering feedback on a general management sidecar. Sankalp and Dinesh
encourage Vinay and myself to kickstart that sidecar using the repair
scheduler patch

Apr 2018: Dinesh reaches out to the dev list (

https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
)
about the general management process to gain further feedback. All feedback
remains positive as it is a potential place for multiple community members
to contribute their various sidecar functionality.

May-Jul 2017: Vinay and I work on creating a basic sidecar for running the
repair scheduler based on the feedback from the community in
CASSANDRA-14346 and CASSANDRA-14395

Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,
nobody objects

Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone
needs review for before 4.0, I mention again that we've nearly got the
basic sidecar and repair scheduling work done and will need help with
review. No one responds.

Aug 2018: We submit a patch that brings a basic distributed sidecar and
robust distributed repair to Cassandra itself. Dinesh mentions that he will
try to review. Now folks appear concerned about it being in tree and
instead maybe it should go in a different repo all together. I don't think
we have consensus on the repo choice yet.

On Sun, Sep 9, 2018 at 9:13 AM sankalp kohli <kohlisank...@gmail.com>
wrote:

I agree with Jon and I think folks who are talking about tech debts in
Reaper should elaborate with examples about these tech debts. Can we be
more precise and list them down? I see it spread out over this long email
thread!!

On Sun, Sep 9, 2018 at 6:29 AM Elliott Sims <elli...@backblaze.com>
wrote:

A big one to add to your list there, IMO as a user:
* API for determining detailed repair state (and history?). Essentially,
something beyond just "Is some sort of repair running?" so that tools
like
Reaper can parallelize better.

On Sun, Sep 9, 2018 at 8:30 AM, Stefan Podkowinski <s...@apache.org>
wrote:

Does it have to be a single project with functionality provided by
multiple plugins? Designing a plugin API at this point seems to be a

bit

early and comes with additional complexity around managing plugins in
general.

I was more thinking into the direction of: "what can we do to enable
people to create any kind of side car or tooling solution?". Thinks

like:

Common cluster discovery and management API
* Detect local Cassandra processes
* Discover and receive events on cluster topology
* Get assigned tokens for nodes
* Read node configuration
* Health checks (as already proposed)

Any side cars should be easy to install on nodes that already run

Cassandra

* Scripts for packaging (tar, deb, rpm)
* Templates for systemd support, optionally with auto-startup

dependency

on the Cassandra main process

Integration testing
* Provide basic testing framework for mocking cluster state and

messages

Support for other languages / avoid having to use JMX
* JMX bridge (HTTP? gRPC?, already implemented in #14346?)

Obviously the whole side car discussion is not moving into a direction
everyone's happy with. Would it be an option to take a step back and
start implementing such a tooling framework with scripts and libraries
for the features described above, as a small GitHub project, instead of
putting an existing side-car solution up for vote? If that would work
and we get people collaborating on code shared between existing
side-cars, then we could take the next step and think about either
revisit the "official Cassandra side-car" topic, or add the created
client tooling framework as official sub-project to the Cassandra
project (maybe via Apache incubator).

On 08.09.18 02:49, Joseph Lynch wrote:

On Fri, Sep 7, 2018 at 5:03 PM Jonathan Haddad <j...@jonhaddad.com>

wrote:

We haven’t even defined any requirements for an admin tool. It’s

hard to

make a case for anything without agreement on what we’re trying to

build.

We were/are trying to sketch out scope/requirements in the #14395 and
#14346 tickets as well as their associated design documents. I think
the general proposed direction is a distributed 1:1 management

sidecar

process similar in architecture to Netflix's Priam except explicitly
built to be general and pluggable by anyone rather than tightly
coupled to AWS.

Dinesh, Vinay and I were aiming for low amounts of scope at first and
take things in an iterative approach with just enough upfront design
but not so much we are unable to make any progress at all. For

example

maybe something like:

1. Get a super simple and non controversial sidecar process that

ships

with Cassandra and exposes a lightweight HTTP interface to e.g. some
basic JMX endpoints
2a. Add a pluggable execution engine for cron/oneshot/scheduled jobs
with the basic interfaces and state store and such
2b. Start scoping and implementing the full HTTP interface, e.g.
backup status, cluster health status, etc ...
3a. Start integrating implementations of the jobs from 2a such as
snapshot, backup, cluster restart, daemon + sstable upgrade, repair,
etc
3b. Start integrating UI components that pair with the HTTP interface

from 2b

4. ?? Perhaps start unlocking next generation operations like moving
"background" activities like compaction, streaming, repair etc into
one or more sidecar contained processes to ensure the main daemon

only

handles read+write requests

There are going to be a lot of questions to answer, and I think

trying

to answer them all up front will mean that we get nowhere or make
unfortunate compromises that cripple the project from the start. If
people think we need to do more design and discussion than we have
been doing then we can spend more time on the design, but personally
I'd rather start iterating on code and prove value incrementally. If
it doesn't work out we won't release it GA to the community ...

-Joey

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Proposing an Apache Cassandra Management process

Reply via email to