Thank you for the CEP and getting this discussion started Andrés! I also think it makes sense to use the term "cluster-wide operations" to refer to these types of orchestrated operations. Another option could be “multi-node operations” to contrast with “single-node operations” since an operation might operate on a subset of nodes in the cluster rather than the entire cluster. In addition to CEP-44 and rolling restarts, there are other cases such as config changes ( https://issues.apache.org/jira/browse/CASSSIDECAR-275) and version upgrades (https://issues.apache.org/jira/browse/CASSSIDECAR-276) which could also build off of these APIs.
On the question of reusing the OperationalJob interface: from an implementation perspective, it seems generic enough to reuse for cluster-wide operations. However, it needs to be extended to add support for the fields we need to support cluster-wide jobs (for example, the nodes involved in the operation, parallelism across racks, etc.) As for the API, I think the question that needs to be answered is if it is worthwhile to have a distinction between single-node operations and cluster-wide operations. For example, if I wanted to restart a single node using the API proposed in CEP-53, I could submit a restart job with a single node in the “nodes” list. This provides API simplicity at the cost of ergonomics. It also means that all inter-sidecar communication would go through the proposed cluster_ops_node_state table. Personally, I think these are acceptable tradeoffs to provide a unified API for operations that is simpler for a user or operator to use and learn. Isaac On Tue, Sep 2, 2025 at 2:58 PM Andrés Beck-Ruiz <[email protected]> wrote: > Thanks everyone for the feedback. +1 to using the term 'cluster-wide > operations'. > > > The only suggestion I have is to keep in mind the pluggability aspect of > > Sidecar. For example, for the Distributed Restart portion of the work, we > > should consider making interfaces that would allow us to potentially move > > the responsibility of keeping the state outside of Cassandra. > > Are you referring to tracking the state of a restart job (and cluster-wide > operations in general) outside of sidecar_internal Cassandra tables? > > > What do you think about broadening the scope of the CEP to propose a way > (API) to perform bulk operations, and propose the current Rolling restarts > as the first implementation for that bulk operations API? I’m proposing > this as I see value to reuse this proposal for other bulk operations such > as enabling CDC (it requires enabling cdc on cassandra.yml and some other > > operations) for better supporting CEP-44. > > We propose a way to persist and monitor cluster-wide operations in the new > sidecar_internal system tables. ( > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables). > I think it would make sense to also generalize the API to apply to > cluster-wide operations. I'm curious about any feedback on whether this > should be a separate API from the current operational job framework and > live under the /cluster resource. We've discussed why we didn't propose to > use the existing API and how the current framework would need to be > extended here ( > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework > ). > > > I’m not quite sold on using a PATCH to move from pending state to > running state. Quick question, what is the goal of the pending state? I see > a PATCH operation as modifying part of an object data. In this case, > modifying the state looks like a change on the operation state, not on its > metadata. I’d love to hear your thoughts on this one. > > The "PENDING" state allows for an operator to double check a submitted > cluster-wide operation, which could have unintended consequences, before > starting it. For example, performing a rolling restart could prevent other > operations on the cluster that might be scheduled or needed, such as > replacing a Cassandra instance. While an operator should be able to abort a > restart job, I see value in having this guard against operator error. > > Given that we are applying a partial update to the resource, which in this > context would be the restart job, we chose PATCH for this API. > > Best, > Andrés > > > On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote: > >> I would like to chime in and say that we need to refine our vocabulary. >> The term 'bulk commands' was used originally in CEP-1. This is my fault >> totally as I originally wrote that down. But over time it has caused >> confusion. I believe 'cluster-wide operations' is a better term to describe >> those operations. We have also used 'Bulk' in the context of CEP-28 which >> means something rather different which leads to confusion. So I propose >> using the term 'cluster-wide operations' for operations that have to be run >> across all nodes in the cluster. >> >> Thanks, >> >> Dinesh >> >> >> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella < >> [email protected]> wrote: >> >>> This is an incredible contribution. Thanks a lot! >>> >>> Now, let me throw some thoughts :-) >>> >>> Rolling restarts is a great example of a broader feature that could be >>> seen as bulk operations on a cluster via Sidecar. >>> >>> What do you think about broadening the scope of the CEP to propose a way >>> (API) to perform bulk operations, and propose the current Rolling restarts >>> as the first implementation for that bulk operations API? I’m proposing >>> this as I see value to reuse this proposal for other bulk operations such >>> as enabling CDC (it requires enabling cdc on cassandra.yml and some other >>> operations) for better supporting CEP-44. >>> >>> I’m not quite sold on using a PATCH to move from pending state to >>> running state. Quick question, what is the goal of the pending state? I see >>> a PATCH operation as modifying part of an object data. In this case, >>> modifying the state looks like a change on the operation state, not on its >>> metadata. I’d love to hear your thoughts on this one. >>> >>> Again, thanks a lot for the contribution! >>> Bernardo >>> >>> >>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]> >>> wrote: >>> > >>> > Thanks Andrés for the CEP. This is a great contribution to the project >>> and >>> > aligns with the original intent of the Sidecar stated in CEP-1. I've >>> gone >>> > over the CEP details and it is consistent with the internals of >>> Sidecar. >>> > >>> > The only suggestion I have is to keep in mind the pluggability aspect >>> of >>> > Sidecar. For example, for the Distributed Restart portion of the work, >>> we >>> > should consider making interfaces that would allow us to potentially >>> move >>> > the responsibility of keeping the state outside of Cassandra. >>> > >>> > Best, >>> > - Francisco >>> > >>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote: >>> >> Hello everyone, >>> >> >>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via >>> Sidecar ( >>> >> >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar >>> >> ) >>> >> >>> >> This CEP builds off of CEP-1 >>> >> < >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated >>> > >>> >> and proposes a design for safe, efficient, and operator friendly >>> rolling >>> >> restarts on Cassandra clusters, as well as an extensible approach for >>> >> persisting future cluster-wide operations in Cassandra Sidecar. We >>> hope to >>> >> leverage this infrastructure in the future to implement upgrade >>> automation. >>> >> >>> >> We welcome all feedback and discussion. Thank you in advance for your >>> time >>> >> and consideration of this proposal! >>> >> >>> >> Best, >>> >> Andrés and Paulo >>> >> >>> >>>
