It took me a bit to wrap my head around how this works, but now that I think I understand the idea, it sounds like a solid improvement. Being able to achieve the same results as quorum but costing 1/3 less is a *big deal* and I know several teams that would be interested.
One thing I'm curious about (and we can break it out into a separate discussion), is how all the functionality that requires coordination and global state (repaired vs non-repaired) will affect backups. Without a synchronization primitive to take a cluster-wide snapshot, how can we safely restore from eventually consistent backups without risking consistency issues due to out-of-sync repaired status? I don't think we need to block any of the proposed work on this - it's just something that's been nagging at me, and I don't know enough about the nuance of Accord, Mutation Tracking or Witness Replicas to say if it affects things or not. If it does, let's make sure we have that documented [1] Jon [1] https://cassandra.apache.org/doc/latest/cassandra/managing/operating/backups.html On Mon, May 5, 2025 at 10:21 AM Nate McCall <zznat...@gmail.com> wrote: > This sounds like a modern feature that will benefit a lot of folks in > cutting storage costs, particularly in large deployments. > > I'd like to see a note on the CEP about documentation overhead as this is > an important feature to communicate correctly, but that's just a nit. +1 on > moving forward with this overall. > > On Sun, May 4, 2025 at 1:58 PM Jordan West <jw...@apache.org> wrote: > >> I’m generally supportive. The concept is one that I can see the benefits >> of and I also think the current implementation adds a lot of complexity to >> the codebase for being stuck in experimental mode. It will be great to have >> a more robust version built on a better approach. >> >> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote: >> >>> +1 >>> >>> This is an obviously good feature for operators that are storage-bound >>> in multi-DC deployments but want to retain their latency characteristics >>> during node maintenance. Log replicas are the right approach. >>> >>> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote: >>> > >>> > Hey everybody, bumping this CEP from Ariel in case you'd like some >>> weekend reading. >>> > >>> > We’d like to finish witnesses and bring them out of “experimental” >>> status now that Transactional Metadata and Mutation Tracking provide the >>> building blocks needed to complete them. >>> > >>> > Witnesses are part of a family of approaches in replicated storage >>> systems to maintain or boost availability and durability while reducing >>> storage costs. Log replicas are a close relative. Both are used by leading >>> cloud databases – for instance, Spanner implements witness replicas [1] >>> while DynamoDB implements log replicas [2]. >>> > >>> > Witness replicas are a great fit for topologies that replicate at >>> greater than RF=3 –– most commonly multi-DC/multi-region deployments. Today >>> in Cassandra, all members of a voting quorum replicate all data forever. >>> Witness replicas let users break this coupling. They allow one to define >>> voting quorums that are larger than the number of copies of data that are >>> stored in perpetuity. >>> > >>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In >>> this topology, Cassandra stores 9× copies of the database forever - huge >>> storage amplification. Witnesses allow users to maintain a voting quorum of >>> 9 members (3× per DC); but reduce the durable replicas to 2× per DC – e.g., >>> two durable replicas and one witness. This maintains the availability >>> properties of an RF=3×3 topology while reducing storage costs by 33%, going >>> from 9× copies to 6×. >>> > >>> > The role of a witness is to "witness" a write and persist it until it >>> has been reconciled among all durable replicas; and to respond to read >>> requests for witnessed writes awaiting reconciliation. Note that witnesses >>> don't introduce a dedicated role for a node – whether a node is a durable >>> replica or witness for a token just depends on its position in the ring. >>> > >>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety >>> property of the witness: guaranteeing that writes have been persisted to >>> all durable replicas before becoming purgeable. CEP-45's journal and >>> reconciliation design provide a great mechanism to ensure this while >>> avoiding the write amplification of incremental repair and anticompaction. >>> > >>> > Take a look at the CEP if you're interested - happy to answer >>> questions and discuss further: >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking >>> > >>> > – Scott >>> > >>> > [1] https://cloud.google.com/spanner/docs/replication >>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf >>> > >>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> >>> wrote: >>> >> >>> >> Hi all, >>> >> >>> >> The CEP is available here: >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 >>> >> >>> >> We would like to propose CEP-46: Finish Transient >>> Replication/Witnesses for adoption by the community. CEP-46 would rename >>> transient replication to witnesses and leverage mutation tracking to >>> implement witnesses as CEP-45 Mutation Tracking based Log Replicas as a >>> replacement for incremental repair based witnesses. >>> >> >>> >> For those not familiar with transient replication it would have the >>> keyspace replication settings declare some replicas as transient and when >>> incremental repair runs the transient replicas would delete data instead of >>> moving it into the repaired set. >>> >> >>> >> With log replicas nodes only materialize mutations in their local >>> LSM for ranges where they are full replicas and not witnesses. For witness >>> ranges a node will write mutations to their local mutation tracking log and >>> participate in background and read time reconciliation. This saves the >>> compaction overhead of IR based witnesses which have to materialize and >>> perform compaction on all mutations even those being applied to witness >>> ranges. >>> >> >>> >> This would address one of the biggest issues with witnesses which is >>> the lack of monotonic reads. Implementation complexity wise this would >>> actually delete code compared to what would be required to complete IR >>> based witnesses because most of the heavy lifting is already done by >>> mutation tracking. >>> >> >>> >> Log replicas also makes it much more practical to realize the cost >>> savings of witnesses because log replicas have easier to characterize >>> resource consumption requirements (write rate * recovery/reconfiguration >>> time) and target a 10x improvement in write throughput. This makes knowing >>> how much capacity can be omitted safer and easier. >>> >> >>> >> Thanks, >>> >> Ariel >>> > >>> >>