Sorry if this was already discussed and I missed it — I had a quick question on the health check scope.
The design seems focused on cluster/availability signals (ring stable, peers up), which is a great start, but doesn’t mention pluggable workload signals like: 1) compaction load (nodetool compactionstats) 2) netstats activity (nodetool netstats) 3) hints backlog / streaming pending flushes or memtable pressure. Since restarting during heavy compaction/hints can add risk, are these kinds of workload-aware checks in scope for the MVP, or considered future work? In some cases, even if the CQL port is up, operators may want to add an additional delay (e.g. 5 minutes) before proceeding to the next batch of nodes. Would it make sense to support this as a configurable option, or via some hook mechanism, so that operators can insert a pause between batches if desired? Jaydeep On Thu, Sep 4, 2025 at 9:44 AM Jindal, Himanshu <[email protected]> wrote: > Hi Andres, > > > > This looks like a great CEP. Having official, source-controlled code > within Cassandra (or a sidecar in this case) to handle common operator > actions would centralize best practices and make the operator experience > smoother—especially for users who may not have deep Cassandra expertise. > > > > A couple of questions: > > 1. Have we considered introducing the concept of a *datacenter* > alongside *cluster*? I imagine there will be cases where a user wants > to perform a rolling restart on a single datacenter rather than across the > entire cluster. > 2. Do we see this framework extending to other cluster- or > datacenter-wide operations, such as scale-up/scale-down operations, or > backups/restores, or nodetool rebuilds run as part of adding a new > datacenter? > > > > Best, > Himanshu > > > > > > *From: *Andrés Beck-Ruiz <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Tuesday, September 2, 2025 at 11:58 AM > *To: *"[email protected]" <[email protected]> > *Subject: *RE: [EXTERNAL] [DISCUSS] CEP 53: Cassandra Rolling Restarts > via Sidecar > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Thanks everyone for the feedback. +1 to using the term 'cluster-wide > operations'. > > > The only suggestion I have is to keep in mind the pluggability aspect of > > Sidecar. For example, for the Distributed Restart portion of the work, we > > should consider making interfaces that would allow us to potentially move > > the responsibility of keeping the state outside of Cassandra. > > Are you referring to tracking the state of a restart job (and cluster-wide > operations in general) outside of sidecar_internal Cassandra tables? > > > What do you think about broadening the scope of the CEP to propose a way > (API) to perform bulk operations, and propose the current Rolling restarts > as the first implementation for that bulk operations API? I’m proposing > this as I see value to reuse this proposal for other bulk operations such > as enabling CDC (it requires enabling cdc on cassandra.yml and some other > > operations) for better supporting CEP-44. > > We propose a way to persist and monitor cluster-wide operations in the new > sidecar_internal system tables. ( > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables). > I think it would make sense to also generalize the API to apply to > cluster-wide operations. I'm curious about any feedback on whether this > should be a separate API from the current operational job framework and > live under the /cluster resource. We've discussed why we didn't propose to > use the existing API and how the current framework would need to be > extended here ( > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework > ). > > > I’m not quite sold on using a PATCH to move from pending state to > running state. Quick question, what is the goal of the pending state? I see > a PATCH operation as modifying part of an object data. In this case, > modifying the state looks like a change on the operation state, not on its > metadata. I’d love to hear your thoughts on this one. > > The "PENDING" state allows for an operator to double check a submitted > cluster-wide operation, which could have unintended consequences, before > starting it. For example, performing a rolling restart could prevent other > operations on the cluster that might be scheduled or needed, such as > replacing a Cassandra instance. While an operator should be able to abort a > restart job, I see value in having this guard against operator error. > > Given that we are applying a partial update to the resource, which in this > context would be the restart job, we chose PATCH for this API. > > Best, > Andrés > > > > On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote: > > I would like to chime in and say that we need to refine our vocabulary. > The term 'bulk commands' was used originally in CEP-1. This is my fault > totally as I originally wrote that down. But over time it has caused > confusion. I believe 'cluster-wide operations' is a better term to describe > those operations. We have also used 'Bulk' in the context of CEP-28 which > means something rather different which leads to confusion. So I propose > using the term 'cluster-wide operations' for operations that have to be run > across all nodes in the cluster. > > > > Thanks, > > > > Dinesh > > > > > > On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella < > [email protected]> wrote: > > This is an incredible contribution. Thanks a lot! > > Now, let me throw some thoughts :-) > > Rolling restarts is a great example of a broader feature that could be > seen as bulk operations on a cluster via Sidecar. > > What do you think about broadening the scope of the CEP to propose a way > (API) to perform bulk operations, and propose the current Rolling restarts > as the first implementation for that bulk operations API? I’m proposing > this as I see value to reuse this proposal for other bulk operations such > as enabling CDC (it requires enabling cdc on cassandra.yml and some other > operations) for better supporting CEP-44. > > I’m not quite sold on using a PATCH to move from pending state to running > state. Quick question, what is the goal of the pending state? I see a PATCH > operation as modifying part of an object data. In this case, modifying the > state looks like a change on the operation state, not on its metadata. I’d > love to hear your thoughts on this one. > > Again, thanks a lot for the contribution! > Bernardo > > > > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]> > wrote: > > > > Thanks Andrés for the CEP. This is a great contribution to the project > and > > aligns with the original intent of the Sidecar stated in CEP-1. I've gone > > over the CEP details and it is consistent with the internals of Sidecar. > > > > The only suggestion I have is to keep in mind the pluggability aspect of > > Sidecar. For example, for the Distributed Restart portion of the work, we > > should consider making interfaces that would allow us to potentially move > > the responsibility of keeping the state outside of Cassandra. > > > > Best, > > - Francisco > > > > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote: > >> Hello everyone, > >> > >> We would like to propose CEP 53: Cassandra Rolling Restarts via Sidecar > ( > >> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar > >> ) > >> > >> This CEP builds off of CEP-1 > >> < > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated > > > >> and proposes a design for safe, efficient, and operator friendly rolling > >> restarts on Cassandra clusters, as well as an extensible approach for > >> persisting future cluster-wide operations in Cassandra Sidecar. We hope > to > >> leverage this infrastructure in the future to implement upgrade > automation. > >> > >> We welcome all feedback and discussion. Thank you in advance for your > time > >> and consideration of this proposal! > >> > >> Best, > >> Andrés and Paulo > >> > >
