Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Andrés Beck-Ruiz Sun, 07 Sep 2025 18:37:51 -0700

Hello all,

Thanks for the feedback. I agree with the suggestions that operation state
storage should be pluggable, with an initial implementation leveraging
Cassandra as we have proposed. I have made edits to the Distributed Restart
Design
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-DistributedRestartDesign>
section
in the CEP to reflect this.

> As for the API, I think the question that needs to be answered is if it is

> worthwhile to have a distinction between single-node operations and

> cluster-wide operations. For example, if I wanted to restart a single node

> using the API proposed in CEP-53, I could submit a restart job with a

> single node in the “nodes” list. This provides API simplicity at the cost

> of ergonomics. It also means that all inter-sidecar communication would go

> through the proposed cluster_ops_node_state table. Personally, I think

> these are acceptable tradeoffs to provide a unified API for operations
that

> is simpler for a user or operator to use and learn.

I agree that we should provide a unified API that does not distinguish
between single-node and cluster-wide operations. I think the benefit of API
simplicity from a development and client perspective outweighs the cost of
ergonomics.

> A small question from my side: I see that the underlying assumption is
that

> Sidecar is able to query Cassandra instances before bouncing/recognizing

> the bounce. What if it could not communicate with the Cassandra instance

> (e.g., binary protocol disabled, C* process experiencing issues, or C*

> process starting as part of a new DC)?

This would fall under scenario #2 in the Error Handling
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-ErrorHandling>
section
of the CEP. If a Sidecar instance can’t communicate with Cassandra, after a
configurable timeout and amount of retries, the Sidecar instance should
mark the job as failed.

> 1. Have we considered introducing the concept of a datacenter alongside
cluster?
> I imagine there will be cases where a user wants to perform a rolling
restart on a
> single datacenter rather than across the entire cluster.

I think this could be added in the future, but for this initial
implementation an operator would submit the nodes part of a datacenter to
restart a datacenter. I prefer providing a unified API that can handle
single node and cluster (or datacenter) wide operations over separate APIs
which might be easier to use in isolation but complicate development and
discoverability.

>2. Do we see this framework extending to other cluster- or datacenter-wide
operations,
> such as scale-up/scale-down operations, or backups/restores, or nodetool
rebuilds

> run as part of adding a new datacenter?

Yes, our goal with this design is that it is extensible for future
operations, as well as currently supported operations (such as node
decommissions) that already exist in Sidecar. In the initial Cassandra
storage implementation, all inter-sidecar communication and operation
tracking could occur in the proposed cluster_ops_node_state table.

> The design seems focused on cluster/availability signals (ring stable,

> peers up), which is a great start, but doesn’t mention pluggable workload

> signals like: 1) compaction load (nodetool compactionstats) 2) netstats

> activity (nodetool netstats) 3) hints backlog / streaming pending flushes

> or memtable pressure.

> Since restarting during heavy compaction/hints can add risk, are these

> kinds of workload-aware checks in scope for the MVP, or considered future

> work?

I agree that the health check should be pluggable as well— this was also
proposed in CEP-1
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)Deprecated-ProposedScope.2>.
For the first iteration of rolling restarts, we are thinking of providing a
health check implementation that checks for all other Cassandra peers being
up, and future work can add more robust health checks.

Best,
Andrés

On Fri, Aug 29, 2025 at 3:56 PM Andrés Beck-Ruiz <[email protected]>
wrote:

> Hello everyone,
>
> We would like to propose CEP 53: Cassandra Rolling Restarts via Sidecar (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
> )
>
> This CEP builds off of CEP-1
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated>
> and proposes a design for safe, efficient, and operator friendly rolling
> restarts on Cassandra clusters, as well as an extensible approach for
> persisting future cluster-wide operations in Cassandra Sidecar. We hope to
> leverage this infrastructure in the future to implement upgrade automation.
>
> We welcome all feedback and discussion. Thank you in advance for your time
> and consideration of this proposal!
>
> Best,
> Andrés and Paulo
>

Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Reply via email to