Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Dinesh Joshi Wed, 03 Sep 2025 10:50:21 -0700

The persistent state store for long running operations is an open question.
Although we have a jira (CASSSIDECAR-341), it does not articulate a
specific plan.


It is generally accepted practice that we want to store the status of the
operations outside of the database / cluster that is being operated. This
circular dependency is not good.

Consider a scenario where rolling restart is progressing and part of the
cluster goes down due to unexpected hardware or network failure. In this
case, if the state store is the cluster itself, you will require manual
intervention to recover from it.

Therefore the state storage should be made pluggable and its default
implementation could leverage Cassandra. It doesn't need to be the same
Cassandra cluster that is being managed by the Cassandra Sidecar.

Thanks,

Dinesh

On Tue, Sep 2, 2025 at 11:58 AM Andrés Beck-Ruiz <[email protected]>
wrote:

> Thanks everyone for the feedback. +1 to using the term 'cluster-wide
> operations'.
>
> > The only suggestion I have is to keep in mind the pluggability aspect of
> > Sidecar. For example, for the Distributed Restart portion of the work, we
> > should consider making interfaces that would allow us to potentially move
> > the responsibility of keeping the state outside of Cassandra.
>
> Are you referring to tracking the state of a restart job (and cluster-wide
> operations in general) outside of sidecar_internal Cassandra tables?
>
> > What do you think about broadening the scope of the CEP to propose a way
> (API) to perform bulk operations, and propose the current Rolling restarts
> as the first implementation for that bulk operations API? I’m proposing
> this as I see value to reuse this proposal for other bulk operations such
> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
> > operations) for better supporting CEP-44.
>
> We propose a way to persist and monitor cluster-wide operations in the new
> sidecar_internal system tables. (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
> I think it would make sense to also generalize the API to apply to
> cluster-wide operations. I'm curious about any feedback on whether this
> should be a separate API from the current operational job framework and
> live under the /cluster resource. We've discussed why we didn't propose to
> use the existing API and how the current framework would need to be
> extended here (
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework
> ).
>
> > I’m not quite sold on using a PATCH to move from pending state to
> running state. Quick question, what is the goal of the pending state? I see
> a PATCH operation as modifying part of an object data. In this case,
> modifying the state looks like a change on the operation state, not on its
> metadata. I’d love to hear your thoughts on this one.
>
> The "PENDING" state allows for an operator to double check a submitted
> cluster-wide operation, which could have unintended consequences, before
> starting it. For example, performing a rolling restart could prevent other
> operations on the cluster that might be scheduled or needed, such as
> replacing a Cassandra instance. While an operator should be able to abort a
> restart job, I see value in having this guard against operator error.
>
> Given that we are applying a partial update to the resource, which in this
> context would be the restart job, we chose PATCH for this API.
>
> Best,
> Andrés
>
>
> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote:
>
>> I would like to chime in and say that we need to refine our vocabulary.
>> The term 'bulk commands' was used originally in CEP-1. This is my fault
>> totally as I originally wrote that down. But over time it has caused
>> confusion. I believe 'cluster-wide operations' is a better term to describe
>> those operations. We have also used 'Bulk' in the context of CEP-28 which
>> means something rather different which leads to confusion. So I propose
>> using the term 'cluster-wide operations' for operations that have to be run
>> across all nodes in the cluster.
>>
>> Thanks,
>>
>> Dinesh
>>
>>
>> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <
>> [email protected]> wrote:
>>
>>> This is an incredible contribution. Thanks a lot!
>>>
>>> Now, let me throw some thoughts :-)
>>>
>>> Rolling restarts is a great example of a broader feature that could be
>>> seen as bulk operations on a cluster via Sidecar.
>>>
>>> What do you think about broadening the scope of the CEP to propose a way
>>> (API) to perform bulk operations, and propose the current Rolling restarts
>>> as the first implementation for that bulk operations API? I’m proposing
>>> this as I see value to reuse this proposal for other bulk operations such
>>> as enabling CDC (it requires enabling cdc on cassandra.yml and some other
>>> operations) for better supporting CEP-44.
>>>
>>> I’m not quite sold on using a PATCH to move from pending state to
>>> running state. Quick question, what is the goal of the pending state? I see
>>> a PATCH operation as modifying part of an object data. In this case,
>>> modifying the state looks like a change on the operation state, not on its
>>> metadata. I’d love to hear your thoughts on this one.
>>>
>>> Again, thanks a lot for the contribution!
>>> Bernardo
>>>
>>>
>>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]>
>>> wrote:
>>> >
>>> > Thanks Andrés for the CEP. This is a great contribution to the project
>>> and
>>> > aligns with the original intent of the Sidecar stated in CEP-1. I've
>>> gone
>>> > over the CEP details and it is consistent with the internals of
>>> Sidecar.
>>> >
>>> > The only suggestion I have is to keep in mind the pluggability aspect
>>> of
>>> > Sidecar. For example, for the Distributed Restart portion of the work,
>>> we
>>> > should consider making interfaces that would allow us to potentially
>>> move
>>> > the responsibility of keeping the state outside of Cassandra.
>>> >
>>> > Best,
>>> > - Francisco
>>> >
>>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
>>> >> Hello everyone,
>>> >>
>>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via
>>> Sidecar (
>>> >>
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
>>> >> )
>>> >>
>>> >> This CEP builds off of CEP-1
>>> >> <
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated
>>> >
>>> >> and proposes a design for safe, efficient, and operator friendly
>>> rolling
>>> >> restarts on Cassandra clusters, as well as an extensible approach for
>>> >> persisting future cluster-wide operations in Cassandra Sidecar. We
>>> hope to
>>> >> leverage this infrastructure in the future to implement upgrade
>>> automation.
>>> >>
>>> >> We welcome all feedback and discussion. Thank you in advance for your
>>> time
>>> >> and consideration of this proposal!
>>> >>
>>> >> Best,
>>> >> Andrés and Paulo
>>> >>
>>>
>>>

Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Reply via email to