Hi, Planning to call a vote on Monday since there don't seem to be any major concerns.
Ariel On Tue, May 6, 2025, at 4:32 PM, Bernardo Botella wrote: > +1 (nb) > >> On May 6, 2025, at 1:19 PM, Josh McKenzie <jmcken...@apache.org> wrote: >> >> +1 >> >> On Tue, May 6, 2025, at 4:06 PM, Yifan Cai wrote: >>> +1 (nb) >>> >>> >>> >>> *From:* Ariel Weisberg <ar...@weisberg.ws> >>> *Sent:* Tuesday, May 6, 2025 12:59:09 PM >>> *To:* Claude Warren, Jr <dev@cassandra.apache.org> >>> *Subject:* Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses >>> >>> Hi, >>> >>> On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote: >>>> I’m generally supportive. The concept is one that I can see the benefits >>>> of and I also think the current implementation adds a lot of complexity to >>>> the codebase for being stuck in experimental mode. It will be great to >>>> have a more robust version built on a better approach. >>> >>> One of the great things about this is that it actually deletes and >>> simplifies implementation code. If you ignore the hat trick of mutation >>> tracking making it possible to have log only replication of course. >>> >>> So far it's been mostly deleted and changed lines to get the single >>> partition read, range read, and write path working. A lot of the code >>> already exists for transient replication so it's changed rather than new >>> code. PaxosV2 and Accord will both need to become witness aware and that >>> will be new code, but it's relatively straightforward in that it's just >>> picking full replicas for reads. >>> >>> On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote: >>>> I'd like to see a note on the CEP about documentation overhead as this is >>>> an important feature to communicate correctly, but that's just a nit. +1 >>>> on moving forward with this overall. >>> There is documentation for transient replication >>> https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html >>> which needs to be promoted out of "What's new", updated, and linked to the >>> documentation for mutation tracking. I'll update the CEP to cover this. >>> >>> >>> On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote: >>>> It took me a bit to wrap my head around how this works, but now that I >>>> think I understand the idea, it sounds like a solid improvement. Being >>>> able to achieve the same results as quorum but costing 1/3 less is a *big >>>> deal* and I know several teams that would be interested. >>> 1/3rd is the "free" threshold where you don't give increase your >>> probability of experiencing data loss using quorums for common topologies. >>> If you have a lot of replicas because say you want copies in many places >>> you might be able to reduce further. Voting on what the value is is >>> basically decoupled from how redundantly that value is stored long term. >>>> One thing I'm curious about (and we can break it out into a separate >>>> discussion), is how all the functionality that requires coordination and >>>> global state (repaired vs non-repaired) will affect backups. Without a >>>> synchronization primitive to take a cluster-wide snapshot, how can we >>>> safely restore from eventually consistent backups without risking >>>> consistency issues due to out-of-sync repaired status? >>> Witnesses doesn't make the consistency of backups better/worse, but it does >>> add a little bit of complexity if your backups are copying only the >>> repaired data. >>> >>> The procedure you follow today where you copy the repaired tables from a >>> range from a single replica and copy the unrepaired tables from a quorum >>> would continue to apply. The added constraint with witnesses is that the >>> single replica you are picking to copy repaired sstables from needs to be a >>> full replica not a witness for that range. >>> >>> I don't think we have a way to get a consistent snapshot right now? Like >>> there isn't even "run repair and repair will create a consistent snapshot >>> for you to copy as a backup". And then as Benedict points out LWT (with >>> async commit) and Accord (also defaults to async commit, has multi-key >>> transactions that can be torn) both don't make for consistent backups. >>> >>> We definitely need to follow up with leveraging new >>> replication/transactions schemes to produce more consistent backups. >>> >>> Ariel >>>> >>>> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote: >>>>> +1 >>>>> >>>>> This is an obviously good feature for operators that are storage-bound in >>>>> multi-DC deployments but want to retain their latency characteristics >>>>> during node maintenance. Log replicas are the right approach. >>>>> >>>>> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote: >>>>> > >>>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some >>>>> > weekend reading. >>>>> > >>>>> > We’d like to finish witnesses and bring them out of “experimental” >>>>> > status now that Transactional Metadata and Mutation Tracking provide >>>>> > the building blocks needed to complete them. >>>>> > >>>>> > Witnesses are part of a family of approaches in replicated storage >>>>> > systems to maintain or boost availability and durability while reducing >>>>> > storage costs. Log replicas are a close relative. Both are used by >>>>> > leading cloud databases – for instance, Spanner implements witness >>>>> > replicas [1] while DynamoDB implements log replicas [2]. >>>>> > >>>>> > Witness replicas are a great fit for topologies that replicate at >>>>> > greater than RF=3 –– most commonly multi-DC/multi-region deployments. >>>>> > Today in Cassandra, all members of a voting quorum replicate all data >>>>> > forever. Witness replicas let users break this coupling. They allow one >>>>> > to define voting quorums that are larger than the number of copies of >>>>> > data that are stored in perpetuity. >>>>> > >>>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In >>>>> > this topology, Cassandra stores 9× copies of the database forever - >>>>> > huge storage amplification. Witnesses allow users to maintain a voting >>>>> > quorum of 9 members (3× per DC); but reduce the durable replicas to 2× >>>>> > per DC – e.g., two durable replicas and one witness. This maintains the >>>>> > availability properties of an RF=3×3 topology while reducing storage >>>>> > costs by 33%, going from 9× copies to 6×. >>>>> > >>>>> > The role of a witness is to "witness" a write and persist it until it >>>>> > has been reconciled among all durable replicas; and to respond to read >>>>> > requests for witnessed writes awaiting reconciliation. Note that >>>>> > witnesses don't introduce a dedicated role for a node – whether a node >>>>> > is a durable replica or witness for a token just depends on its >>>>> > position in the ring. >>>>> > >>>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety >>>>> > property of the witness: guaranteeing that writes have been persisted >>>>> > to all durable replicas before becoming purgeable. CEP-45's journal and >>>>> > reconciliation design provide a great mechanism to ensure this while >>>>> > avoiding the write amplification of incremental repair and >>>>> > anticompaction. >>>>> > >>>>> > Take a look at the CEP if you're interested - happy to answer questions >>>>> > and discuss further: >>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking >>>>> > >>>>> > – Scott >>>>> > >>>>> > [1] https://cloud.google.com/spanner/docs/replication >>>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf >>>>> > >>>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>> >> >>>>> >> Hi all, >>>>> >> >>>>> >> The CEP is available here: >>>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 >>>>> >> >>>>> >> We would like to propose CEP-46: Finish Transient >>>>> >> Replication/Witnesses for adoption by the community. CEP-46 would >>>>> >> rename transient replication to witnesses and leverage mutation >>>>> >> tracking to implement witnesses as CEP-45 Mutation Tracking based Log >>>>> >> Replicas as a replacement for incremental repair based witnesses. >>>>> >> >>>>> >> For those not familiar with transient replication it would have the >>>>> >> keyspace replication settings declare some replicas as transient and >>>>> >> when incremental repair runs the transient replicas would delete data >>>>> >> instead of moving it into the repaired set. >>>>> >> >>>>> >> With log replicas nodes only materialize mutations in their local LSM >>>>> >> for ranges where they are full replicas and not witnesses. For witness >>>>> >> ranges a node will write mutations to their local mutation tracking >>>>> >> log and participate in background and read time reconciliation. This >>>>> >> saves the compaction overhead of IR based witnesses which have to >>>>> >> materialize and perform compaction on all mutations even those being >>>>> >> applied to witness ranges. >>>>> >> >>>>> >> This would address one of the biggest issues with witnesses which is >>>>> >> the lack of monotonic reads. Implementation complexity wise this would >>>>> >> actually delete code compared to what would be required to complete IR >>>>> >> based witnesses because most of the heavy lifting is already done by >>>>> >> mutation tracking. >>>>> >> >>>>> >> Log replicas also makes it much more practical to realize the cost >>>>> >> savings of witnesses because log replicas have easier to characterize >>>>> >> resource consumption requirements (write rate * >>>>> >> recovery/reconfiguration time) and target a 10x improvement in write >>>>> >> throughput. This makes knowing how much capacity can be omitted safer >>>>> >> and easier. >>>>> >> >>>>> >> Thanks, >>>>> >> Ariel >>>>> > >>>