Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

David Capwell Tue, 06 May 2025 12:07:45 -0700
+1

> On May 6, 2025, at 10:53 AM, Dmitry Konstantinov <netud...@gmail.com> wrote:
> 
> +1 (nb)
> 
> On Tue, 6 May 2025 at 17:32, Aleksey Yeshchenko <alek...@apple.com 
> <mailto:alek...@apple.com>> wrote:
>> +1
>> 
>>> On 5 May 2025, at 23:24, Blake Eggleston <bl...@ultrablake.com 
>>> <mailto:bl...@ultrablake.com>> wrote:
>>> 
>>> As mutation tracking relates to existing backup systems that account for 
>>> repaired vs unrepaired sstables, mutation tracking will continue to promote 
>>> sstables to repaired once we know they contain data that has been fully 
>>> reconciled. The main difference is that they won’t be promoted as part of 
>>> an explicit range repair, but by compaction as they’re able to be promoted.
>>> 
>>> (also +1 to finishing witnesses)
>>> 
>>> On Mon, May 5, 2025, at 11:45 AM, Benedict Elliott Smith wrote:
>>>> Consistent backup/restore is a fundamentally hard and unsolved problem for 
>>>> Cassandra today (without any of the mentioned features). In particular, we 
>>>> break the real-time guarantee of the linearizability property (most 
>>>> notably for LWTs) between partitions for any backup/restore process today.
>>>> 
>>>> Fixing this should be relatively straight-forward for Accord, and 
>>>> something we intend to address in follow-up work. Fixing it for eventually 
>>>> consistent (or Paxos/LWT) operations is I think achievable, with or 
>>>> without mutation tracking (probably easier with mutation tracking). I’m 
>>>> not sure of any plans to try to tackle this though.
>>>> 
>>>> Witness replicas should not particularly matter at all to any of the above.
>>>> 
>>>>> On 5 May 2025, at 18:49, Jon Haddad <j...@rustyrazorblade.com 
>>>>> <mailto:j...@rustyrazorblade.com>> wrote:
>>>>> 
>>>>> It took me a bit to wrap my head around how this works, but now that I 
>>>>> think I understand the idea, it sounds like a solid improvement.  Being 
>>>>> able to achieve the same results as quorum but costing 1/3 less is a *big 
>>>>> deal* and I know several teams that would be interested.
>>>>> 
>>>>> One thing I'm curious about (and we can break it out into a separate 
>>>>> discussion), is how all the functionality that requires coordination and 
>>>>> global state (repaired vs non-repaired) will affect backups.  Without a 
>>>>> synchronization primitive to take a cluster-wide snapshot, how can we 
>>>>> safely restore from eventually consistent backups without risking 
>>>>> consistency issues due to out-of-sync repaired status?
>>>>> 
>>>>> I don't think we need to block any of the proposed work on this - it's 
>>>>> just something that's been nagging at me, and I don't know enough about 
>>>>> the nuance of Accord, Mutation Tracking or Witness Replicas to say if it 
>>>>> affects things or not.  If it does, let's make sure we have that 
>>>>> documented [1]
>>>>> 
>>>>> Jon
>>>>> 
>>>>> [1] 
>>>>> https://cassandra.apache.org/doc/latest/cassandra/managing/operating/backups.html
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, May 5, 2025 at 10:21 AM Nate McCall <zznat...@gmail.com 
>>>>> <mailto:zznat...@gmail.com>> wrote:
>>>>> This sounds like a modern feature that will benefit a lot of folks in 
>>>>> cutting storage costs, particularly in large deployments.
>>>>> 
>>>>> I'd like to see a note on the CEP about documentation overhead as this is 
>>>>> an important feature to communicate correctly, but that's just a nit. +1 
>>>>> on moving forward with this overall. 
>>>>> 
>>>>> On Sun, May 4, 2025 at 1:58 PM Jordan West <jw...@apache.org 
>>>>> <mailto:jw...@apache.org>> wrote:
>>>>> I’m generally supportive. The concept is one that I can see the benefits 
>>>>> of and I also think the current implementation adds a lot of complexity 
>>>>> to the codebase for being stuck in experimental mode. It will be great to 
>>>>> have a more robust version built on a better approach. 
>>>>> 
>>>>> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org 
>>>>> <mailto:bened...@apache.org>> wrote:
>>>>> +1
>>>>> 
>>>>> This is an obviously good feature for operators that are storage-bound in 
>>>>> multi-DC deployments but want to retain their latency characteristics 
>>>>> during node maintenance. Log replicas are the right approach.
>>>>> 
>>>>> > On 3 May 2025, at 23:42, sc...@paradoxica.net 
>>>>> > <mailto:sc...@paradoxica.net> wrote:
>>>>> >
>>>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some 
>>>>> > weekend reading.
>>>>> >
>>>>> > We’d like to finish witnesses and bring them out of “experimental” 
>>>>> > status now that Transactional Metadata and Mutation Tracking provide 
>>>>> > the building blocks needed to complete them.
>>>>> >
>>>>> > Witnesses are part of a family of approaches in replicated storage 
>>>>> > systems to maintain or boost availability and durability while reducing 
>>>>> > storage costs. Log replicas are a close relative. Both are used by 
>>>>> > leading cloud databases – for instance, Spanner implements witness 
>>>>> > replicas [1] while DynamoDB implements log replicas [2].
>>>>> >
>>>>> > Witness replicas are a great fit for topologies that replicate at 
>>>>> > greater than RF=3 –– most commonly multi-DC/multi-region deployments. 
>>>>> > Today in Cassandra, all members of a voting quorum replicate all data 
>>>>> > forever. Witness replicas let users break this coupling. They allow one 
>>>>> > to define voting quorums that are larger than the number of copies of 
>>>>> > data that are stored in perpetuity.
>>>>> >
>>>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In 
>>>>> > this topology, Cassandra stores 9× copies of the database forever - 
>>>>> > huge storage amplification. Witnesses allow users to maintain a voting 
>>>>> > quorum of 9 members (3× per DC); but reduce the durable replicas to 2× 
>>>>> > per DC – e.g., two durable replicas and one witness. This maintains the 
>>>>> > availability properties of an RF=3×3 topology while reducing storage 
>>>>> > costs by 33%, going from 9× copies to 6×.
>>>>> >
>>>>> > The role of a witness is to "witness" a write and persist it until it 
>>>>> > has been reconciled among all durable replicas; and to respond to read 
>>>>> > requests for witnessed writes awaiting reconciliation. Note that 
>>>>> > witnesses don't introduce a dedicated role for a node – whether a node 
>>>>> > is a durable replica or witness for a token just depends on its 
>>>>> > position in the ring.
>>>>> >
>>>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety 
>>>>> > property of the witness: guaranteeing that writes have been persisted 
>>>>> > to all durable replicas before becoming purgeable. CEP-45's journal and 
>>>>> > reconciliation design provide a great mechanism to ensure this while 
>>>>> > avoiding the write amplification of incremental repair and 
>>>>> > anticompaction.
>>>>> >
>>>>> > Take a look at the CEP if you're interested - happy to answer questions 
>>>>> > and discuss further: 
>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>>>>> >
>>>>> > – Scott
>>>>> >
>>>>> > [1] https://cloud.google.com/spanner/docs/replication
>>>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>>>>> >
>>>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws 
>>>>> >> <mailto:ar...@weisberg.ws>> wrote:
>>>>> >>
>>>>> >> Hi all,
>>>>> >>
>>>>> >> The CEP is available here: 
>>>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>>>> >>
>>>>> >> We would like to propose CEP-46: Finish Transient 
>>>>> >> Replication/Witnesses for adoption by the community. CEP-46 would 
>>>>> >> rename transient replication to witnesses and leverage mutation 
>>>>> >> tracking to implement witnesses as CEP-45 Mutation Tracking based Log 
>>>>> >> Replicas as a replacement for incremental repair based witnesses.
>>>>> >>
>>>>> >> For those not familiar with transient replication it would have the 
>>>>> >> keyspace replication settings declare some replicas as transient and 
>>>>> >> when incremental repair runs the transient replicas would delete data 
>>>>> >> instead of moving it into the repaired set.
>>>>> >>
>>>>> >> With log replicas nodes only  materialize mutations in their local LSM 
>>>>> >> for ranges where they are full replicas and not witnesses. For witness 
>>>>> >> ranges a node will write mutations to their local mutation tracking 
>>>>> >> log and participate in background and read time reconciliation. This 
>>>>> >> saves the compaction overhead of IR based witnesses which have to 
>>>>> >> materialize and perform compaction on all mutations even those being 
>>>>> >> applied to witness ranges.
>>>>> >>
>>>>> >> This would address one of the biggest issues with witnesses which is 
>>>>> >> the lack of monotonic reads. Implementation complexity wise this would 
>>>>> >> actually delete code compared to what would be required to complete IR 
>>>>> >> based witnesses because most of the heavy lifting is already done by 
>>>>> >> mutation tracking.
>>>>> >>
>>>>> >> Log replicas also makes it much more practical to realize the cost 
>>>>> >> savings of witnesses because log replicas have easier to characterize 
>>>>> >> resource consumption requirements (write rate * 
>>>>> >> recovery/reconfiguration time) and target a 10x improvement in write 
>>>>> >> throughput.  This makes knowing how much capacity can be omitted safer 
>>>>> >> and easier.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Ariel
>>>>> >
>>> 
>> 
> 
> 
> 
> --
> Dmitry Konstantinov
Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to