Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Jaydeep Chovatia Sun, 07 Sep 2025 08:15:52 -0700

Sorry if this was already discussed and I missed it — I had a quick
question on the health check scope.


The design seems focused on cluster/availability signals (ring stable,
peers up), which is a great start, but doesn’t mention pluggable workload
signals like:  1) compaction load (nodetool compactionstats)  2) netstats
activity (nodetool netstats) 3) hints backlog / streaming  pending flushes
or memtable pressure.
Since restarting during heavy compaction/hints can add risk, are these
kinds of workload-aware checks in scope for the MVP, or considered future
work?

In some cases, even if the CQL port is up, operators may want to add an
additional delay (e.g. 5 minutes) before proceeding to the next batch of
nodes. Would it make sense to support this as a configurable option, or via
some hook mechanism, so that operators can insert a pause between batches
if desired?

Jaydeep

On Thu, Sep 4, 2025 at 9:44 AM Jindal, Himanshu <[email protected]> wrote:

> Hi Andres,
>
>
>
> This looks like a great CEP. Having official, source-controlled code
> within Cassandra (or a sidecar in this case) to handle common operator
> actions would centralize best practices and make the operator experience
> smoother—especially for users who may not have deep Cassandra expertise.
>
>
>
> A couple of questions:
>
>    1. Have we considered introducing the concept of a *datacenter*
>    alongside *cluster*? I imagine there will be cases where a user wants
>    to perform a rolling restart on a single datacenter rather than across the
>    entire cluster.
>    2. Do we see this framework extending to other cluster- or
>    datacenter-wide operations, such as scale-up/scale-down operations, or
>    backups/restores, or nodetool rebuilds run as part of adding a new
>    datacenter?
>
>
>
> Best,
> Himanshu
>
>
>
>
>
> *From: *Andrés Beck-Ruiz <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, September 2, 2025 at 11:58 AM
> *To: *"[email protected]" <[email protected]>
> *Subject: *RE: [EXTERNAL] [DISCUSS] CEP 53: Cassandra Rolling Restarts
> via Sidecar
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Thanks everyone for the feedback. +1 to using the term 'cluster-wide
> operations'.
>
> > The only suggestion I have is to keep in mind the pluggability aspect of
> > Sidecar. For example, for the Distributed Restart portion of the work, we
> > should consider making interfaces that would allow us to potentially move
> > the responsibility of keeping the state outside of Cassandra.
>
> Are you referring to tracking the state of a restart job (and cluster-wide
> operations in general) outside of sidecar_internal Cassandra tables?
>
> > What do you think about broadening the scope of the CEP to propose a way
> (API) to perform bulk operations, and propose the current Rolling restarts
> as the first implementation for that bulk operations API? I’m proposing
> this as I see value to reuse this proposal for other bulk operations such
> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
> > operations) for better supporting CEP-44.
>
> We propose a way to persist and monitor cluster-wide operations in the new
> sidecar_internal system tables. (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
> I think it would make sense to also generalize the API to apply to
> cluster-wide operations. I'm curious about any feedback on whether this
> should be a separate API from the current operational job framework and
> live under the /cluster resource. We've discussed why we didn't propose to
> use the existing API and how the current framework would need to be
> extended here (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework
> ).
>
> > I’m not quite sold on using a PATCH to move from pending state to
> running state. Quick question, what is the goal of the pending state? I see
> a PATCH operation as modifying part of an object data. In this case,
> modifying the state looks like a change on the operation state, not on its
> metadata. I’d love to hear your thoughts on this one.
>
> The "PENDING" state allows for an operator to double check a submitted
> cluster-wide operation, which could have unintended consequences, before
> starting it. For example, performing a rolling restart could prevent other
> operations on the cluster that might be scheduled or needed, such as
> replacing a Cassandra instance. While an operator should be able to abort a
> restart job, I see value in having this guard against operator error.
>
> Given that we are applying a partial update to the resource, which in this
> context would be the restart job, we chose PATCH for this API.
>
> Best,
> Andrés
>
>
>
> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote:
>
> I would like to chime in and say that we need to refine our vocabulary.
> The term 'bulk commands' was used originally in CEP-1. This is my fault
> totally as I originally wrote that down. But over time it has caused
> confusion. I believe 'cluster-wide operations' is a better term to describe
> those operations. We have also used 'Bulk' in the context of CEP-28 which
> means something rather different which leads to confusion. So I propose
> using the term 'cluster-wide operations' for operations that have to be run
> across all nodes in the cluster.
>
>
>
> Thanks,
>
>
>
> Dinesh
>
>
>
>
>
> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <
> [email protected]> wrote:
>
> This is an incredible contribution. Thanks a lot!
>
> Now, let me throw some thoughts :-)
>
> Rolling restarts is a great example of a broader feature that could be
> seen as bulk operations on a cluster via Sidecar.
>
> What do you think about broadening the scope of the CEP to propose a way
> (API) to perform bulk operations, and propose the current Rolling restarts
> as the first implementation for that bulk operations API? I’m proposing
> this as I see value to reuse this proposal for other bulk operations such
> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
> operations) for better supporting CEP-44.
>
> I’m not quite sold on using a PATCH to move from pending state to running
> state. Quick question, what is the goal of the pending state? I see a PATCH
> operation as modifying part of an object data. In this case, modifying the
> state looks like a change on the operation state, not on its metadata. I’d
> love to hear your thoughts on this one.
>
> Again, thanks a lot for the contribution!
> Bernardo
>
>
> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]>
> wrote:
> >
> > Thanks Andrés for the CEP. This is a great contribution to the project
> and
> > aligns with the original intent of the Sidecar stated in CEP-1. I've gone
> > over the CEP details and it is consistent with the internals of Sidecar.
> >
> > The only suggestion I have is to keep in mind the pluggability aspect of
> > Sidecar. For example, for the Distributed Restart portion of the work, we
> > should consider making interfaces that would allow us to potentially move
> > the responsibility of keeping the state outside of Cassandra.
> >
> > Best,
> > - Francisco
> >
> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
> >> Hello everyone,
> >>
> >> We would like to propose CEP 53: Cassandra Rolling Restarts via Sidecar
> (
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
> >> )
> >>
> >> This CEP builds off of CEP-1
> >> <
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated
> >
> >> and proposes a design for safe, efficient, and operator friendly rolling
> >> restarts on Cassandra clusters, as well as an extensible approach for
> >> persisting future cluster-wide operations in Cassandra Sidecar. We hope
> to
> >> leverage this infrastructure in the future to implement upgrade
> automation.
> >>
> >> We welcome all feedback and discussion. Thank you in advance for your
> time
> >> and consideration of this proposal!
> >>
> >> Best,
> >> Andrés and Paulo
> >>
>
>

Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Reply via email to