Thanks for the CEP! It fills the gaps of CEP-40 as well.

A small question from my side: I see that the underlying assumption is that
Sidecar is able to query Cassandra instances before bouncing/recognizing
the bounce. What if it could not communicate with the Cassandra instance
(e.g., binary protocol disabled, C* process experiencing issues, or C*
process starting as part of a new DC)?

+1 to Francisco's points on pluggability. Different implementations should
be able to maintain state in their own way.

On Wed, Sep 3, 2025 at 11:20 PM Dinesh Joshi <[email protected]> wrote:

> The persistent state store for long running operations is an open
> question. Although we have a jira (CASSSIDECAR-341), it does not articulate
> a specific plan.
>
> It is generally accepted practice that we want to store the status of the
> operations outside of the database / cluster that is being operated. This
> circular dependency is not good.
>
> Consider a scenario where rolling restart is progressing and part of the
> cluster goes down due to unexpected hardware or network failure. In this
> case, if the state store is the cluster itself, you will require manual
> intervention to recover from it.
>
> Therefore the state storage should be made pluggable and its default
> implementation could leverage Cassandra. It doesn't need to be the same
> Cassandra cluster that is being managed by the Cassandra Sidecar.
>
> Thanks,
>
> Dinesh
>
> On Tue, Sep 2, 2025 at 11:58 AM Andrés Beck-Ruiz <[email protected]>
> wrote:
>
>> Thanks everyone for the feedback. +1 to using the term 'cluster-wide
>> operations'.
>>
>> > The only suggestion I have is to keep in mind the pluggability aspect of
>> > Sidecar. For example, for the Distributed Restart portion of the work,
>> we
>> > should consider making interfaces that would allow us to potentially
>> move
>> > the responsibility of keeping the state outside of Cassandra.
>>
>> Are you referring to tracking the state of a restart job (and
>> cluster-wide operations in general) outside of sidecar_internal Cassandra
>> tables?
>>
>> > What do you think about broadening the scope of the CEP to propose a
>> way (API) to perform bulk operations, and propose the current Rolling
>> restarts as the first implementation for that bulk operations API? I’m
>> proposing this as I see value to reuse this proposal for other bulk
>> operations such as enabling CDC (it requires enabling cdc on cassandra.yml
>> and some other
>> > operations) for better supporting CEP-44.
>>
>> We propose a way to persist and monitor cluster-wide operations in the
>> new sidecar_internal system tables. (
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
>> I think it would make sense to also generalize the API to apply to
>> cluster-wide operations. I'm curious about any feedback on whether this
>> should be a separate API from the current operational job framework and
>> live under the /cluster resource. We've discussed why we didn't propose to
>> use the existing API and how the current framework would need to be
>> extended here (
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework
>> ).
>>
>> > I’m not quite sold on using a PATCH to move from pending state to
>> running state. Quick question, what is the goal of the pending state? I see
>> a PATCH operation as modifying part of an object data. In this case,
>> modifying the state looks like a change on the operation state, not on its
>> metadata. I’d love to hear your thoughts on this one.
>>
>> The "PENDING" state allows for an operator to double check a submitted
>> cluster-wide operation, which could have unintended consequences, before
>> starting it. For example, performing a rolling restart could prevent other
>> operations on the cluster that might be scheduled or needed, such as
>> replacing a Cassandra instance. While an operator should be able to abort a
>> restart job, I see value in having this guard against operator error.
>>
>> Given that we are applying a partial update to the resource, which in
>> this context would be the restart job, we chose PATCH for this API.
>>
>> Best,
>> Andrés
>>
>>
>> On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <[email protected]> wrote:
>>
>>> I would like to chime in and say that we need to refine our vocabulary.
>>> The term 'bulk commands' was used originally in CEP-1. This is my fault
>>> totally as I originally wrote that down. But over time it has caused
>>> confusion. I believe 'cluster-wide operations' is a better term to describe
>>> those operations. We have also used 'Bulk' in the context of CEP-28 which
>>> means something rather different which leads to confusion. So I propose
>>> using the term 'cluster-wide operations' for operations that have to be run
>>> across all nodes in the cluster.
>>>
>>> Thanks,
>>>
>>> Dinesh
>>>
>>>
>>> On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <
>>> [email protected]> wrote:
>>>
>>>> This is an incredible contribution. Thanks a lot!
>>>>
>>>> Now, let me throw some thoughts :-)
>>>>
>>>> Rolling restarts is a great example of a broader feature that could be
>>>> seen as bulk operations on a cluster via Sidecar.
>>>>
>>>> What do you think about broadening the scope of the CEP to propose a
>>>> way (API) to perform bulk operations, and propose the current Rolling
>>>> restarts as the first implementation for that bulk operations API? I’m
>>>> proposing this as I see value to reuse this proposal for other bulk
>>>> operations such as enabling CDC (it requires enabling cdc on cassandra.yml
>>>> and some other operations) for better supporting CEP-44.
>>>>
>>>> I’m not quite sold on using a PATCH to move from pending state to
>>>> running state. Quick question, what is the goal of the pending state? I see
>>>> a PATCH operation as modifying part of an object data. In this case,
>>>> modifying the state looks like a change on the operation state, not on its
>>>> metadata. I’d love to hear your thoughts on this one.
>>>>
>>>> Again, thanks a lot for the contribution!
>>>> Bernardo
>>>>
>>>>
>>>> > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero <[email protected]>
>>>> wrote:
>>>> >
>>>> > Thanks Andrés for the CEP. This is a great contribution to the
>>>> project and
>>>> > aligns with the original intent of the Sidecar stated in CEP-1. I've
>>>> gone
>>>> > over the CEP details and it is consistent with the internals of
>>>> Sidecar.
>>>> >
>>>> > The only suggestion I have is to keep in mind the pluggability aspect
>>>> of
>>>> > Sidecar. For example, for the Distributed Restart portion of the
>>>> work, we
>>>> > should consider making interfaces that would allow us to potentially
>>>> move
>>>> > the responsibility of keeping the state outside of Cassandra.
>>>> >
>>>> > Best,
>>>> > - Francisco
>>>> >
>>>> > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
>>>> >> Hello everyone,
>>>> >>
>>>> >> We would like to propose CEP 53: Cassandra Rolling Restarts via
>>>> Sidecar (
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
>>>> >> )
>>>> >>
>>>> >> This CEP builds off of CEP-1
>>>> >> <
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated
>>>> >
>>>> >> and proposes a design for safe, efficient, and operator friendly
>>>> rolling
>>>> >> restarts on Cassandra clusters, as well as an extensible approach for
>>>> >> persisting future cluster-wide operations in Cassandra Sidecar. We
>>>> hope to
>>>> >> leverage this infrastructure in the future to implement upgrade
>>>> automation.
>>>> >>
>>>> >> We welcome all feedback and discussion. Thank you in advance for
>>>> your time
>>>> >> and consideration of this proposal!
>>>> >>
>>>> >> Best,
>>>> >> Andrés and Paulo
>>>> >>
>>>>
>>>>

Reply via email to