Thank you for the CEP and getting this discussion started Andrés!

I also think it makes sense to use the term "cluster-wide operations" to
refer to these types of orchestrated operations. Another option could be
“multi-node operations” to contrast with “single-node operations” since an
operation might operate on a subset of nodes in the cluster rather than the
entire cluster. In addition to CEP-44 and rolling restarts, there are other
cases such as config changes (
https://issues.apache.org/jira/browse/CASSSIDECAR-275) and version upgrades
(https://issues.apache.org/jira/browse/CASSSIDECAR-276) which could also
build off of these APIs.

On the question of reusing the OperationalJob interface: from an
implementation perspective, it seems generic enough to reuse for
cluster-wide operations. However, it needs to be extended to add support
for the fields we need to support cluster-wide jobs (for example, the nodes
involved in the operation, parallelism across racks, etc.)

As for the API, I think the question that needs to be answered is if it is
worthwhile to have a distinction between single-node operations and
cluster-wide operations. For example, if I wanted to restart a single node
using the API proposed in CEP-53, I could submit a restart job with a
single node in the “nodes” list. This provides API simplicity at the cost
of ergonomics. It also means that all inter-sidecar communication would go
through the proposed cluster_ops_node_state table. Personally, I think
these are acceptable tradeoffs to provide a unified API for operations that
is simpler for a user or operator to use and learn.
Isaac

On Tue, Sep 2, 2025 at 2:58 PM Andrés Beck-Ruiz <[email protected]>
wrote:

> Thanks everyone for the feedback. +1 to using the term 'cluster-wide
> operations'.
>
> > The only suggestion I have is to keep in mind the pluggability aspect of
> > Sidecar. For example, for the Distributed Restart portion of the work, we
> > should consider making interfaces that would allow us to potentially move
> > the responsibility of keeping the state outside of Cassandra.
>
> Are you referring to tracking the state of a restart job (and cluster-wide
> operations in general) outside of sidecar_internal Cassandra tables?
>
> > What do you think about broadening the scope of the CEP to propose a way
> (API) to perform bulk operations, and propose the current Rolling restarts
> as the first implementation for that bulk operations API? I’m proposing
> this as I see value to reuse this proposal for other bulk operations such
> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
> > operations) for better supporting CEP-44.
>
> We propose a way to persist and monitor cluster-wide operations in the new
> sidecar_internal system tables. (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
> I think it would make sense to also generalize the API to apply to
> cluster-wide operations. I'm curious about any feedback on whether this
> should be a separate API from the current operational job framework and
> live under the /cluster resource. We've discussed why we didn't propose to
> use the existing API and how the current framework would need to be
> extended here (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework
> ).
>
> > I’m not quite sold on using a PATCH to move from pending state to
> running state. Quick question, what is the goal of the pending state? I see
> a PATCH operation as modifying part of an object data. In this case,
> modifying the state looks like a change on the operation state, not on its
> metadata. I’d love to hear your thoughts on this one.
>
> The "PENDING" state allows for an operator to double check a submitted
> cluster-wide operation, which could have unintended consequences, before
> starting it. For example, performing a rolling restart could prevent other
> operations on the cluster that might be scheduled or needed, such as
> replacing a Cassandra instance. While an operator should be able to abort a
> restart job, I see value in having this guard against operator error.
>
> Given that we are applying a partial update to the resource, which in this
> context would be the restart job, we chose PATCH for this API.
>
> Best,
> Andrés
>
>
> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote:
>
>> I would like to chime in and say that we need to refine our vocabulary.
>> The term 'bulk commands' was used originally in CEP-1. This is my fault
>> totally as I originally wrote that down. But over time it has caused
>> confusion. I believe 'cluster-wide operations' is a better term to describe
>> those operations. We have also used 'Bulk' in the context of CEP-28 which
>> means something rather different which leads to confusion. So I propose
>> using the term 'cluster-wide operations' for operations that have to be run
>> across all nodes in the cluster.
>>
>> Thanks,
>>
>> Dinesh
>>
>>
>> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <
>> [email protected]> wrote:
>>
>>> This is an incredible contribution. Thanks a lot!
>>>
>>> Now, let me throw some thoughts :-)
>>>
>>> Rolling restarts is a great example of a broader feature that could be
>>> seen as bulk operations on a cluster via Sidecar.
>>>
>>> What do you think about broadening the scope of the CEP to propose a way
>>> (API) to perform bulk operations, and propose the current Rolling restarts
>>> as the first implementation for that bulk operations API? I’m proposing
>>> this as I see value to reuse this proposal for other bulk operations such
>>> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
>>> operations) for better supporting CEP-44.
>>>
>>> I’m not quite sold on using a PATCH to move from pending state to
>>> running state. Quick question, what is the goal of the pending state? I see
>>> a PATCH operation as modifying part of an object data. In this case,
>>> modifying the state looks like a change on the operation state, not on its
>>> metadata. I’d love to hear your thoughts on this one.
>>>
>>> Again, thanks a lot for the contribution!
>>> Bernardo
>>>
>>>
>>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]>
>>> wrote:
>>> >
>>> > Thanks Andrés for the CEP. This is a great contribution to the project
>>> and
>>> > aligns with the original intent of the Sidecar stated in CEP-1. I've
>>> gone
>>> > over the CEP details and it is consistent with the internals of
>>> Sidecar.
>>> >
>>> > The only suggestion I have is to keep in mind the pluggability aspect
>>> of
>>> > Sidecar. For example, for the Distributed Restart portion of the work,
>>> we
>>> > should consider making interfaces that would allow us to potentially
>>> move
>>> > the responsibility of keeping the state outside of Cassandra.
>>> >
>>> > Best,
>>> > - Francisco
>>> >
>>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
>>> >> Hello everyone,
>>> >>
>>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via
>>> Sidecar (
>>> >>
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
>>> >> )
>>> >>
>>> >> This CEP builds off of CEP-1
>>> >> <
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated
>>> >
>>> >> and proposes a design for safe, efficient, and operator friendly
>>> rolling
>>> >> restarts on Cassandra clusters, as well as an extensible approach for
>>> >> persisting future cluster-wide operations in Cassandra Sidecar. We
>>> hope to
>>> >> leverage this infrastructure in the future to implement upgrade
>>> automation.
>>> >>
>>> >> We welcome all feedback and discussion. Thank you in advance for your
>>> time
>>> >> and consideration of this proposal!
>>> >>
>>> >> Best,
>>> >> Andrés and Paulo
>>> >>
>>>
>>>

Reply via email to