Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Jon Haddad Tue, 02 Dec 2025 09:34:17 -0800

> This proposal targets the use case in which only a few nodes in the
region fail, but not all.


If you're able to handle inconsistent results, why not use LOCAL_ONE? If
you have a preference for consistency but can tolerate some inconsistency
sometimes, DowngradingConsistencyRetryPolicy [1] might work for you. It's a
bit contentious in the community (some folks absolutely hate its
existence), but I've had a few use cases where it makes sense.

Or is this scenario when 3 nodes fail and they happen to contain the same
token range? I'm curious what the math looks like. Are you using 256 vnodes
by any chance?

If you execute a query and all 3 replicas in your local DC are down,
currently the driver throws an exception. I don't see how putting this
logic on the server helps, since the driver would need to be aware of
REMOTE_QUORUM to route to the remote replicas anyway. At that point you're
just adding an extra hop between the app and the database.

If you have a database outage so severe that all 3 replicas are
unavailable, what's the state of the rest of the cluster? I'm genuinely
curious how often you'd have a case where 3 nodes with intersecting ranges
are offline and it's not a complete DC failure.

I agree with Patrick and Stefan here. If you want to stay in the DC, doing
it at the driver level gives you complete control. If your infra is failing
that spectacularly, evacuate.

I also wonder if witness replicas would help you, if you're seeing that
much failure.

[1]
https://docs.datastax.com/en/drivers/java/4.17/com/datastax/oss/driver/api/core/retry/DowngradingConsistencyRetryPolicy.html


On Mon, Dec 1, 2025 at 7:37 PM Patrick McFadin <[email protected]> wrote:

> In my experience it usually makes more sense to evacuate out of a degraded
> DC at the application layer rather than having Cassandra silently fail over
> to a remote quorum. When a DC is in a bad state, it’s rarely “just a couple
> of Cassandra replicas are down.” The same underlying problem (network
> partitions, bad routing, DNS issues, load balancer problems, AZ failures,
> power incidents, I could go on...)
> What I can see is that REMOTE_QUORUM will behave like a dynamic
> reconfiguration of the replica group for a given request. That’s
> fundamentally different from today’s NTS contract, where the replica set
> per DC is static. That would open up all sorts of new ops headaches.
>  - What happens replicas in the degraded DC? Would hints get stored?
>  - I'm not sure if this would work with PAXOS since dynamic
> reconfiguration isn't supported. Two leaders possibly?
>  - Which now brings up Accord and TCM considerations...
>
> So my strong bias is, once a DC is degraded enough that Cassandra can’t
> achieve LOCAL_QUORUM, that DC is probably not a place you want to keep
> serving user traffic from. At that point, I’d rather have the application
> tier evacuate and talk to a healthy DC, instead of the database internally
> changing its consistency behavior under the covers. It's just a ton of work
> for a problem with a simpiler solution.
>
> Patrick
>
>
> On Mon, Dec 1, 2025 at 7:07 PM Jaydeep Chovatia <
> [email protected]> wrote:
>
>> >For the record, there is whole section about this here (sorry for
>> pasting the same link twice in my original mail)
>>
>> Thanks, Stefan, for sharing the details about the client-side failover
>> mechanism. This technique is definitely useful when the entire region is
>> down.
>>
>> This proposal targets the use case in which only a few nodes in the
>> region fail, but not all. For use cases that prioritize availability over
>> latency, they can avail this option, and the server will automatically
>> fulfill those requests from the remote region.
>>
>> Jaydeep
>>
>> On Mon, Dec 1, 2025 at 2:58 PM Štefan Miklošovič <[email protected]>
>> wrote:
>>
>>> For the record, there is whole section about this here (sorry for
>>> pasting the same link twice in my original mail)
>>>
>>>
>>> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/load_balancing/index.html#cross-datacenter-failover
>>>
>>> Interesting to see your perspective, I can see how doing something
>>> without application changes might seem appealing.
>>>
>>> I wonder what others think about this.
>>>
>>> Regards
>>>
>>> On Mon, Dec 1, 2025 at 11:39 PM Qc L <[email protected]> wrote:
>>>
>>>> Hi Stefan,
>>>>
>>>> Thanks for raising the question about using a custom
>>>> LoadBalancingPolicy! Remote quorum and LB-based failover aim to solve
>>>> similar problems, maintaining availability during datacenter degradation,
>>>> so I did a comparison study on the server-side remote quorum solution
>>>> versus relying on driver-side logic.
>>>>
>>>>
>>>>    - For our use cases, we found that client-side failover is
>>>>    expensive to operate and leads to fragmented behavior during incidents. 
>>>> We
>>>>    also prefer failover to be triggered only when needed, not 
>>>> automatically. A
>>>>    server-side mechanism gives us controlled, predictable behavior and 
>>>> makes
>>>>    the system easier to operate.
>>>>    - Remote quorum is implemented on the server, where the coordinator
>>>>    can intentionally form a quorum using replicas in a backup region when 
>>>> the
>>>>    local region cannot satisfy LOCAL_QUORUM or EACH_QUORUM. The
>>>>    database determines the fallback region, how replicas are selected, and
>>>>    ensures consistency guarantees still hold.
>>>>
>>>> By keeping this logic internal to the server, we provide a unified,
>>>> consistent behavior across all clients without requiring any application
>>>> changes.
>>>>
>>>> Happy to discuss further if helpful.
>>>>
>>>> On Mon, Dec 1, 2025 at 12:57 PM Štefan Miklošovič <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> before going deeper into your proposal ... I just have to ask, have
>>>>> you tried e.g. custom LoadBalancingPolicy in driver, if you use that?
>>>>>
>>>>> I can imagine that if the driver detects a node to be down then based
>>>>> on its "distance" / dc it belongs to you might start to create a different
>>>>> query plan which talks to remote DC and similar.
>>>>>
>>>>>
>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>
>>>>>
>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>
>>>>> On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected]> wrote:
>>>>>
>>>>>> Hello team,
>>>>>>
>>>>>> I’d like to propose adding the remote quorum to Cassandra consistency
>>>>>> level for handling simultaneous hosts unavailability in the local data
>>>>>> center that cannot achieve the required quorum.
>>>>>>
>>>>>> Background
>>>>>> NetworkTopologyStrategy is the most commonly used strategy at Uber,
>>>>>> and we use Local_Quorum for read/write in many use cases. Our Cassandra
>>>>>> deployment in each data center currently relies on majority replicas 
>>>>>> being
>>>>>> healthy to consistently achieve local quorum.
>>>>>>
>>>>>> Current behavior
>>>>>> When a local data center in a Cassandra deployment experiences
>>>>>> outages, network isolation, or maintenance events, the EACH_QUORUM /
>>>>>> LOCAL_QUORUM consistency level will fail for both reads and writes if
>>>>>> enough replicas in that the wlocal data center are unavailable. In this
>>>>>> configuration, simultaneous hosts unavailability can temporarily prevent
>>>>>> the cluster from reaching the required quorum for reads and writes. For
>>>>>> applications that require high availability and a seamless user 
>>>>>> experience,
>>>>>> this can lead to service downtime and a noticeable drop in overall
>>>>>> availability. see attached figure1.[image: figure1.png]
>>>>>>
>>>>>> Proposed Solution
>>>>>> To prevent this issue and ensure a seamless user experience, we can
>>>>>> use the Remote Quorum consistency level as a fallback mechanism in
>>>>>> scenarios where local replicas are unavailable. Remote Quorum in 
>>>>>> Cassandra
>>>>>> refers to a read or write operation that achieves quorum (a majority of
>>>>>> replicas) in the remote data center, rather than relying solely on 
>>>>>> replicas
>>>>>> within the local data center. see attached Figure2.
>>>>>>
>>>>>> [image: figure2.png]
>>>>>>
>>>>>> We will add a feature to do read/write consistency level override on
>>>>>> the server side. When local replicas are not available, we will overwrite
>>>>>> the server side write consistency level from each quorum to remote 
>>>>>> quorum.
>>>>>> Note that, implementing this change in client side will require some
>>>>>> protocol changes in CQL, we only add this on server side which can only 
>>>>>> be
>>>>>> used by server internal.
>>>>>>
>>>>>> For example, giving the following Cassandra setup
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *CREATE KEYSPACE ks WITH REPLICATION = {  'class':
>>>>>> 'NetworkTopologyStrategy',  'cluster1': 3,  'cluster2': 3,  'cluster3': 
>>>>>> 3};*
>>>>>>
>>>>>> The selected approach for this design is to explicitly configure a
>>>>>> backup data center mapping for the local data center, where each data
>>>>>> center defines its preferred failover target. For example
>>>>>> *remote_quorum_target_data_center:*
>>>>>>
>>>>>>
>>>>>> *  cluster1: cluster2  cluster2: cluster3  cluster3: cluster1*
>>>>>>
>>>>>> Implementations
>>>>>> We proposed the following feature to Cassandra to address data center
>>>>>> failure scenarios
>>>>>>
>>>>>>    1. Introduce a new Consistency level called remote quorum.
>>>>>>    2. Feature to do read/write consistency level override on server
>>>>>>    side. (This can be controlled by a feature flag). Use Node tools 
>>>>>> command to
>>>>>>    turn on/off the server failback
>>>>>>
>>>>>> Why remote quorum is useful
>>>>>> As shown in the figure 2, we have the data center failure in one of
>>>>>> the data centers, Local quorum and each quorum will fail since two 
>>>>>> replicas
>>>>>> are unavailable and cannot meet the quorum requirement in the local data
>>>>>> center.
>>>>>>
>>>>>> During the incident, we can use the nodetool to enable failover to
>>>>>> remote quorum. With the failover enabled, we can failover to the remote
>>>>>> data center for read and write, which avoids the available drop.
>>>>>>
>>>>>> Any thoughts or concerns?
>>>>>> Thanks in advance for your feedback,
>>>>>>
>>>>>> Qiaochu
>>>>>>
>>>>>>

Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Reply via email to