Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Štefan Miklošovič Wed, 03 Sep 2025 09:18:09 -0700

Yes, I agree with you, Isaac. We might look at the restart operation (or
any other operation which can be done cluster-wide, really), as it is an
operation which is performed on a subset of nodes. A subset containing one
node is still a subset of one node. The number of nodes involved should not
matter. It is just by accident that restarting a whole cluster would
require the specification of a subset which is equal to a set of all nodes.


On Wed, Sep 3, 2025 at 6:08 PM Isaac Reath <[email protected]> wrote:

> Thank you for the CEP and getting this discussion started Andrés!
>
> I also think it makes sense to use the term "cluster-wide operations" to
> refer to these types of orchestrated operations. Another option could be
> “multi-node operations” to contrast with “single-node operations” since an
> operation might operate on a subset of nodes in the cluster rather than the
> entire cluster. In addition to CEP-44 and rolling restarts, there are other
> cases such as config changes (
> https://issues.apache.org/jira/browse/CASSSIDECAR-275) and version
> upgrades (https://issues.apache.org/jira/browse/CASSSIDECAR-276) which
> could also build off of these APIs.
>
> On the question of reusing the OperationalJob interface: from an
> implementation perspective, it seems generic enough to reuse for
> cluster-wide operations. However, it needs to be extended to add support
> for the fields we need to support cluster-wide jobs (for example, the nodes
> involved in the operation, parallelism across racks, etc.)
>
> As for the API, I think the question that needs to be answered is if it is
> worthwhile to have a distinction between single-node operations and
> cluster-wide operations. For example, if I wanted to restart a single node
> using the API proposed in CEP-53, I could submit a restart job with a
> single node in the “nodes” list. This provides API simplicity at the cost
> of ergonomics. It also means that all inter-sidecar communication would go
> through the proposed cluster_ops_node_state table. Personally, I think
> these are acceptable tradeoffs to provide a unified API for operations that
> is simpler for a user or operator to use and learn.
> Isaac
>
> On Tue, Sep 2, 2025 at 2:58 PM Andrés Beck-Ruiz <[email protected]>
> wrote:
>
>> Thanks everyone for the feedback. +1 to using the term 'cluster-wide
>> operations'.
>>
>> > The only suggestion I have is to keep in mind the pluggability aspect of
>> > Sidecar. For example, for the Distributed Restart portion of the work,
>> we
>> > should consider making interfaces that would allow us to potentially
>> move
>> > the responsibility of keeping the state outside of Cassandra.
>>
>> Are you referring to tracking the state of a restart job (and
>> cluster-wide operations in general) outside of sidecar_internal Cassandra
>> tables?
>>
>> > What do you think about broadening the scope of the CEP to propose a
>> way (API) to perform bulk operations, and propose the current Rolling
>> restarts as the first implementation for that bulk operations API? I’m
>> proposing this as I see value to reuse this proposal for other bulk
>> operations such as enabling CDC (it requires enabling cdc on cassandra.yml
>> and some other
>> > operations) for better supporting CEP-44.
>>
>> We propose a way to persist and monitor cluster-wide operations in the
>> new sidecar_internal system tables. (
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
>> I think it would make sense to also generalize the API to apply to
>> cluster-wide operations. I'm curious about any feedback on whether this
>> should be a separate API from the current operational job framework and
>> live under the /cluster resource. We've discussed why we didn't propose to
>> use the existing API and how the current framework would need to be
>> extended here (
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework
>> ).
>>
>> > I’m not quite sold on using a PATCH to move from pending state to
>> running state. Quick question, what is the goal of the pending state? I see
>> a PATCH operation as modifying part of an object data. In this case,
>> modifying the state looks like a change on the operation state, not on its
>> metadata. I’d love to hear your thoughts on this one.
>>
>> The "PENDING" state allows for an operator to double check a submitted
>> cluster-wide operation, which could have unintended consequences, before
>> starting it. For example, performing a rolling restart could prevent other
>> operations on the cluster that might be scheduled or needed, such as
>> replacing a Cassandra instance. While an operator should be able to abort a
>> restart job, I see value in having this guard against operator error.
>>
>> Given that we are applying a partial update to the resource, which in
>> this context would be the restart job, we chose PATCH for this API.
>>
>> Best,
>> Andrés
>>
>>
>> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote:
>>
>>> I would like to chime in and say that we need to refine our vocabulary.
>>> The term 'bulk commands' was used originally in CEP-1. This is my fault
>>> totally as I originally wrote that down. But over time it has caused
>>> confusion. I believe 'cluster-wide operations' is a better term to describe
>>> those operations. We have also used 'Bulk' in the context of CEP-28 which
>>> means something rather different which leads to confusion. So I propose
>>> using the term 'cluster-wide operations' for operations that have to be run
>>> across all nodes in the cluster.
>>>
>>> Thanks,
>>>
>>> Dinesh
>>>
>>>
>>> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <
>>> [email protected]> wrote:
>>>
>>>> This is an incredible contribution. Thanks a lot!
>>>>
>>>> Now, let me throw some thoughts :-)
>>>>
>>>> Rolling restarts is a great example of a broader feature that could be
>>>> seen as bulk operations on a cluster via Sidecar.
>>>>
>>>> What do you think about broadening the scope of the CEP to propose a
>>>> way (API) to perform bulk operations, and propose the current Rolling
>>>> restarts as the first implementation for that bulk operations API? I’m
>>>> proposing this as I see value to reuse this proposal for other bulk
>>>> operations such as enabling CDC (it requires enabling cdc on cassandra.yml
>>>> and some other operations) for better supporting CEP-44.
>>>>
>>>> I’m not quite sold on using a PATCH to move from pending state to
>>>> running state. Quick question, what is the goal of the pending state? I see
>>>> a PATCH operation as modifying part of an object data. In this case,
>>>> modifying the state looks like a change on the operation state, not on its
>>>> metadata. I’d love to hear your thoughts on this one.
>>>>
>>>> Again, thanks a lot for the contribution!
>>>> Bernardo
>>>>
>>>>
>>>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]>
>>>> wrote:
>>>> >
>>>> > Thanks Andrés for the CEP. This is a great contribution to the
>>>> project and
>>>> > aligns with the original intent of the Sidecar stated in CEP-1. I've
>>>> gone
>>>> > over the CEP details and it is consistent with the internals of
>>>> Sidecar.
>>>> >
>>>> > The only suggestion I have is to keep in mind the pluggability aspect
>>>> of
>>>> > Sidecar. For example, for the Distributed Restart portion of the
>>>> work, we
>>>> > should consider making interfaces that would allow us to potentially
>>>> move
>>>> > the responsibility of keeping the state outside of Cassandra.
>>>> >
>>>> > Best,
>>>> > - Francisco
>>>> >
>>>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
>>>> >> Hello everyone,
>>>> >>
>>>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via
>>>> Sidecar (
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
>>>> >> )
>>>> >>
>>>> >> This CEP builds off of CEP-1
>>>> >> <
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated
>>>> >
>>>> >> and proposes a design for safe, efficient, and operator friendly
>>>> rolling
>>>> >> restarts on Cassandra clusters, as well as an extensible approach for
>>>> >> persisting future cluster-wide operations in Cassandra Sidecar. We
>>>> hope to
>>>> >> leverage this infrastructure in the future to implement upgrade
>>>> automation.
>>>> >>
>>>> >> We welcome all feedback and discussion. Thank you in advance for
>>>> your time
>>>> >> and consideration of this proposal!
>>>> >>
>>>> >> Best,
>>>> >> Andrés and Paulo
>>>> >>
>>>>
>>>>

Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Reply via email to