Yes, I agree with you, Isaac. We might look at the restart operation (or any other operation which can be done cluster-wide, really), as it is an operation which is performed on a subset of nodes. A subset containing one node is still a subset of one node. The number of nodes involved should not matter. It is just by accident that restarting a whole cluster would require the specification of a subset which is equal to a set of all nodes.
On Wed, Sep 3, 2025 at 6:08 PM Isaac Reath <[email protected]> wrote: > Thank you for the CEP and getting this discussion started Andrés! > > I also think it makes sense to use the term "cluster-wide operations" to > refer to these types of orchestrated operations. Another option could be > “multi-node operations” to contrast with “single-node operations” since an > operation might operate on a subset of nodes in the cluster rather than the > entire cluster. In addition to CEP-44 and rolling restarts, there are other > cases such as config changes ( > https://issues.apache.org/jira/browse/CASSSIDECAR-275) and version > upgrades (https://issues.apache.org/jira/browse/CASSSIDECAR-276) which > could also build off of these APIs. > > On the question of reusing the OperationalJob interface: from an > implementation perspective, it seems generic enough to reuse for > cluster-wide operations. However, it needs to be extended to add support > for the fields we need to support cluster-wide jobs (for example, the nodes > involved in the operation, parallelism across racks, etc.) > > As for the API, I think the question that needs to be answered is if it is > worthwhile to have a distinction between single-node operations and > cluster-wide operations. For example, if I wanted to restart a single node > using the API proposed in CEP-53, I could submit a restart job with a > single node in the “nodes” list. This provides API simplicity at the cost > of ergonomics. It also means that all inter-sidecar communication would go > through the proposed cluster_ops_node_state table. Personally, I think > these are acceptable tradeoffs to provide a unified API for operations that > is simpler for a user or operator to use and learn. > Isaac > > On Tue, Sep 2, 2025 at 2:58 PM Andrés Beck-Ruiz <[email protected]> > wrote: > >> Thanks everyone for the feedback. +1 to using the term 'cluster-wide >> operations'. >> >> > The only suggestion I have is to keep in mind the pluggability aspect of >> > Sidecar. For example, for the Distributed Restart portion of the work, >> we >> > should consider making interfaces that would allow us to potentially >> move >> > the responsibility of keeping the state outside of Cassandra. >> >> Are you referring to tracking the state of a restart job (and >> cluster-wide operations in general) outside of sidecar_internal Cassandra >> tables? >> >> > What do you think about broadening the scope of the CEP to propose a >> way (API) to perform bulk operations, and propose the current Rolling >> restarts as the first implementation for that bulk operations API? I’m >> proposing this as I see value to reuse this proposal for other bulk >> operations such as enabling CDC (it requires enabling cdc on cassandra.yml >> and some other >> > operations) for better supporting CEP-44. >> >> We propose a way to persist and monitor cluster-wide operations in the >> new sidecar_internal system tables. ( >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables). >> I think it would make sense to also generalize the API to apply to >> cluster-wide operations. I'm curious about any feedback on whether this >> should be a separate API from the current operational job framework and >> live under the /cluster resource. We've discussed why we didn't propose to >> use the existing API and how the current framework would need to be >> extended here ( >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework >> ). >> >> > I’m not quite sold on using a PATCH to move from pending state to >> running state. Quick question, what is the goal of the pending state? I see >> a PATCH operation as modifying part of an object data. In this case, >> modifying the state looks like a change on the operation state, not on its >> metadata. I’d love to hear your thoughts on this one. >> >> The "PENDING" state allows for an operator to double check a submitted >> cluster-wide operation, which could have unintended consequences, before >> starting it. For example, performing a rolling restart could prevent other >> operations on the cluster that might be scheduled or needed, such as >> replacing a Cassandra instance. While an operator should be able to abort a >> restart job, I see value in having this guard against operator error. >> >> Given that we are applying a partial update to the resource, which in >> this context would be the restart job, we chose PATCH for this API. >> >> Best, >> Andrés >> >> >> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote: >> >>> I would like to chime in and say that we need to refine our vocabulary. >>> The term 'bulk commands' was used originally in CEP-1. This is my fault >>> totally as I originally wrote that down. But over time it has caused >>> confusion. I believe 'cluster-wide operations' is a better term to describe >>> those operations. We have also used 'Bulk' in the context of CEP-28 which >>> means something rather different which leads to confusion. So I propose >>> using the term 'cluster-wide operations' for operations that have to be run >>> across all nodes in the cluster. >>> >>> Thanks, >>> >>> Dinesh >>> >>> >>> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella < >>> [email protected]> wrote: >>> >>>> This is an incredible contribution. Thanks a lot! >>>> >>>> Now, let me throw some thoughts :-) >>>> >>>> Rolling restarts is a great example of a broader feature that could be >>>> seen as bulk operations on a cluster via Sidecar. >>>> >>>> What do you think about broadening the scope of the CEP to propose a >>>> way (API) to perform bulk operations, and propose the current Rolling >>>> restarts as the first implementation for that bulk operations API? I’m >>>> proposing this as I see value to reuse this proposal for other bulk >>>> operations such as enabling CDC (it requires enabling cdc on cassandra.yml >>>> and some other operations) for better supporting CEP-44. >>>> >>>> I’m not quite sold on using a PATCH to move from pending state to >>>> running state. Quick question, what is the goal of the pending state? I see >>>> a PATCH operation as modifying part of an object data. In this case, >>>> modifying the state looks like a change on the operation state, not on its >>>> metadata. I’d love to hear your thoughts on this one. >>>> >>>> Again, thanks a lot for the contribution! >>>> Bernardo >>>> >>>> >>>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]> >>>> wrote: >>>> > >>>> > Thanks Andrés for the CEP. This is a great contribution to the >>>> project and >>>> > aligns with the original intent of the Sidecar stated in CEP-1. I've >>>> gone >>>> > over the CEP details and it is consistent with the internals of >>>> Sidecar. >>>> > >>>> > The only suggestion I have is to keep in mind the pluggability aspect >>>> of >>>> > Sidecar. For example, for the Distributed Restart portion of the >>>> work, we >>>> > should consider making interfaces that would allow us to potentially >>>> move >>>> > the responsibility of keeping the state outside of Cassandra. >>>> > >>>> > Best, >>>> > - Francisco >>>> > >>>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote: >>>> >> Hello everyone, >>>> >> >>>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via >>>> Sidecar ( >>>> >> >>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar >>>> >> ) >>>> >> >>>> >> This CEP builds off of CEP-1 >>>> >> < >>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated >>>> > >>>> >> and proposes a design for safe, efficient, and operator friendly >>>> rolling >>>> >> restarts on Cassandra clusters, as well as an extensible approach for >>>> >> persisting future cluster-wide operations in Cassandra Sidecar. We >>>> hope to >>>> >> leverage this infrastructure in the future to implement upgrade >>>> automation. >>>> >> >>>> >> We welcome all feedback and discussion. Thank you in advance for >>>> your time >>>> >> and consideration of this proposal! >>>> >> >>>> >> Best, >>>> >> Andrés and Paulo >>>> >> >>>> >>>>
