Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Jon Haddad Mon, 05 May 2025 10:49:29 -0700

It took me a bit to wrap my head around how this works, but now that I
think I understand the idea, it sounds like a solid improvement.  Being
able to achieve the same results as quorum but costing 1/3 less is a *big
deal* and I know several teams that would be interested.


One thing I'm curious about (and we can break it out into a separate
discussion), is how all the functionality that requires coordination and
global state (repaired vs non-repaired) will affect backups.  Without a
synchronization primitive to take a cluster-wide snapshot, how can we
safely restore from eventually consistent backups without risking
consistency issues due to out-of-sync repaired status?

I don't think we need to block any of the proposed work on this - it's just
something that's been nagging at me, and I don't know enough about the
nuance of Accord, Mutation Tracking or Witness Replicas to say if it
affects things or not.  If it does, let's make sure we have that documented
[1]

Jon

[1]
https://cassandra.apache.org/doc/latest/cassandra/managing/operating/backups.html



On Mon, May 5, 2025 at 10:21 AM Nate McCall <[email protected]> wrote:

> This sounds like a modern feature that will benefit a lot of folks in
> cutting storage costs, particularly in large deployments.
>
> I'd like to see a note on the CEP about documentation overhead as this is
> an important feature to communicate correctly, but that's just a nit. +1 on
> moving forward with this overall.
>
> On Sun, May 4, 2025 at 1:58 PM Jordan West <[email protected]> wrote:
>
>> I’m generally supportive. The concept is one that I can see the benefits
>> of and I also think the current implementation adds a lot of complexity to
>> the codebase for being stuck in experimental mode. It will be great to have
>> a more robust version built on a better approach.
>>
>> On Sun, May 4, 2025 at 00:27 Benedict <[email protected]> wrote:
>>
>>> +1
>>>
>>> This is an obviously good feature for operators that are storage-bound
>>> in multi-DC deployments but want to retain their latency characteristics
>>> during node maintenance. Log replicas are the right approach.
>>>
>>> > On 3 May 2025, at 23:42, [email protected] wrote:
>>> >
>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some
>>> weekend reading.
>>> >
>>> > We’d like to finish witnesses and bring them out of “experimental”
>>> status now that Transactional Metadata and Mutation Tracking provide the
>>> building blocks needed to complete them.
>>> >
>>> > Witnesses are part of a family of approaches in replicated storage
>>> systems to maintain or boost availability and durability while reducing
>>> storage costs. Log replicas are a close relative. Both are used by leading
>>> cloud databases – for instance, Spanner implements witness replicas [1]
>>> while DynamoDB implements log replicas [2].
>>> >
>>> > Witness replicas are a great fit for topologies that replicate at
>>> greater than RF=3 –– most commonly multi-DC/multi-region deployments. Today
>>> in Cassandra, all members of a voting quorum replicate all data forever.
>>> Witness replicas let users break this coupling. They allow one to define
>>> voting quorums that are larger than the number of copies of data that are
>>> stored in perpetuity.
>>> >
>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In
>>> this topology, Cassandra stores 9× copies of the database forever - huge
>>> storage amplification. Witnesses allow users to maintain a voting quorum of
>>> 9 members (3× per DC); but reduce the durable replicas to 2× per DC – e.g.,
>>> two durable replicas and one witness. This maintains the availability
>>> properties of an RF=3×3 topology while reducing storage costs by 33%, going
>>> from 9× copies to 6×.
>>> >
>>> > The role of a witness is to "witness" a write and persist it until it
>>> has been reconciled among all durable replicas; and to respond to read
>>> requests for witnessed writes awaiting reconciliation. Note that witnesses
>>> don't introduce a dedicated role for a node – whether a node is a durable
>>> replica or witness for a token just depends on its position in the ring.
>>> >
>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety
>>> property of the witness: guaranteeing that writes have been persisted to
>>> all durable replicas before becoming purgeable. CEP-45's journal and
>>> reconciliation design provide a great mechanism to ensure this while
>>> avoiding the write amplification of incremental repair and anticompaction.
>>> >
>>> > Take a look at the CEP if you're interested - happy to answer
>>> questions and discuss further:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>>> >
>>> > – Scott
>>> >
>>> > [1] https://cloud.google.com/spanner/docs/replication
>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>>> >
>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> The CEP is available here:
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>> >>
>>> >> We would like to propose CEP-46: Finish Transient
>>> Replication/Witnesses for adoption by the community. CEP-46 would rename
>>> transient replication to witnesses and leverage mutation tracking to
>>> implement witnesses as CEP-45 Mutation Tracking based Log Replicas as a
>>> replacement for incremental repair based witnesses.
>>> >>
>>> >> For those not familiar with transient replication it would have the
>>> keyspace replication settings declare some replicas as transient and when
>>> incremental repair runs the transient replicas would delete data instead of
>>> moving it into the repaired set.
>>> >>
>>> >> With log replicas nodes only  materialize mutations in their local
>>> LSM for ranges where they are full replicas and not witnesses. For witness
>>> ranges a node will write mutations to their local mutation tracking log and
>>> participate in background and read time reconciliation. This saves the
>>> compaction overhead of IR based witnesses which have to materialize and
>>> perform compaction on all mutations even those being applied to witness
>>> ranges.
>>> >>
>>> >> This would address one of the biggest issues with witnesses which is
>>> the lack of monotonic reads. Implementation complexity wise this would
>>> actually delete code compared to what would be required to complete IR
>>> based witnesses because most of the heavy lifting is already done by
>>> mutation tracking.
>>> >>
>>> >> Log replicas also makes it much more practical to realize the cost
>>> savings of witnesses because log replicas have easier to characterize
>>> resource consumption requirements (write rate * recovery/reconfiguration
>>> time) and target a 10x improvement in write throughput.  This makes knowing
>>> how much capacity can be omitted safer and easier.
>>> >>
>>> >> Thanks,
>>> >> Ariel
>>> >
>>>
>>

Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to