Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Ariel Weisberg Fri, 09 May 2025 13:01:02 -0700

Hi,

Planning to call a vote on Monday since there don't seem to be any major 
concerns.


Ariel

On Tue, May 6, 2025, at 4:32 PM, Bernardo Botella wrote:
> +1 (nb)
> 
>> On May 6, 2025, at 1:19 PM, Josh McKenzie <jmcken...@apache.org> wrote:
>> 
>> +1
>> 
>> On Tue, May 6, 2025, at 4:06 PM, Yifan Cai wrote:
>>> +1 (nb)
>>> 
>>> 
>>> 
>>> *From:* Ariel Weisberg <ar...@weisberg.ws>
>>> *Sent:* Tuesday, May 6, 2025 12:59:09 PM
>>> *To:* Claude Warren, Jr <dev@cassandra.apache.org>
>>> *Subject:* Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses
>>>  
>>> Hi,
>>> 
>>> On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote:
>>>> I’m generally supportive. The concept is one that I can see the benefits 
>>>> of and I also think the current implementation adds a lot of complexity to 
>>>> the codebase for being stuck in experimental mode. It will be great to 
>>>> have a more robust version built on a better approach. 
>>> 
>>> One of the great things about this is that it actually deletes and 
>>> simplifies implementation code. If you ignore the hat trick of mutation 
>>> tracking making it possible to have log only replication of course.
>>> 
>>> So far it's been mostly deleted and changed lines to get the single 
>>> partition read, range read, and write path working. A lot of the code 
>>> already exists for transient replication so it's changed rather than new 
>>> code. PaxosV2 and Accord will both need to become witness aware and that 
>>> will be new code, but it's relatively straightforward in that it's just 
>>> picking full replicas for reads.
>>> 
>>> On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote:
>>>> I'd like to see a note on the CEP about documentation overhead as this is 
>>>> an important feature to communicate correctly, but that's just a nit. +1 
>>>> on moving forward with this overall. 
>>> There is documentation for transient replication 
>>> https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html
>>>  which needs to be promoted out of "What's new", updated, and linked to the 
>>> documentation for mutation tracking. I'll update the CEP to cover this.
>>> 
>>> 
>>> On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote:
>>>> It took me a bit to wrap my head around how this works, but now that I 
>>>> think I understand the idea, it sounds like a solid improvement.  Being 
>>>> able to achieve the same results as quorum but costing 1/3 less is a *big 
>>>> deal* and I know several teams that would be interested.
>>> 1/3rd is the "free" threshold where you don't give increase your 
>>> probability of experiencing data loss using quorums for common topologies. 
>>> If you have a lot of replicas because say you want copies in many places 
>>> you might be able to reduce further. Voting on what the value is is 
>>> basically decoupled from how redundantly that value is stored long term.
>>>> One thing I'm curious about (and we can break it out into a separate 
>>>> discussion), is how all the functionality that requires coordination and 
>>>> global state (repaired vs non-repaired) will affect backups.  Without a 
>>>> synchronization primitive to take a cluster-wide snapshot, how can we 
>>>> safely restore from eventually consistent backups without risking 
>>>> consistency issues due to out-of-sync repaired status?
>>> Witnesses doesn't make the consistency of backups better/worse, but it does 
>>> add a little bit of complexity if your backups are copying only the 
>>> repaired data.
>>> 
>>> The procedure you follow today where you copy the repaired tables from a 
>>> range from a single replica and copy the unrepaired tables from a quorum 
>>> would continue to apply. The added constraint with witnesses is that the 
>>> single replica you are picking to copy repaired sstables from needs to be a 
>>> full replica not a witness for that range.
>>> 
>>> I don't think we have a way to get a consistent snapshot right now? Like 
>>> there isn't even "run repair and repair will create a consistent snapshot 
>>> for you to copy as a backup". And then as Benedict points out LWT (with 
>>> async commit) and Accord (also defaults to async commit, has multi-key 
>>> transactions that can be torn) both don't make for consistent backups.
>>> 
>>> We definitely need to follow up with leveraging new 
>>> replication/transactions schemes to produce more consistent backups.
>>> 
>>> Ariel
>>>> 
>>>> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote:
>>>>> +1
>>>>> 
>>>>> This is an obviously good feature for operators that are storage-bound in 
>>>>> multi-DC deployments but want to retain their latency characteristics 
>>>>> during node maintenance. Log replicas are the right approach.
>>>>> 
>>>>> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote:
>>>>> >
>>>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some 
>>>>> > weekend reading.
>>>>> >
>>>>> > We’d like to finish witnesses and bring them out of “experimental” 
>>>>> > status now that Transactional Metadata and Mutation Tracking provide 
>>>>> > the building blocks needed to complete them.
>>>>> >
>>>>> > Witnesses are part of a family of approaches in replicated storage 
>>>>> > systems to maintain or boost availability and durability while reducing 
>>>>> > storage costs. Log replicas are a close relative. Both are used by 
>>>>> > leading cloud databases – for instance, Spanner implements witness 
>>>>> > replicas [1] while DynamoDB implements log replicas [2].
>>>>> >
>>>>> > Witness replicas are a great fit for topologies that replicate at 
>>>>> > greater than RF=3 –– most commonly multi-DC/multi-region deployments. 
>>>>> > Today in Cassandra, all members of a voting quorum replicate all data 
>>>>> > forever. Witness replicas let users break this coupling. They allow one 
>>>>> > to define voting quorums that are larger than the number of copies of 
>>>>> > data that are stored in perpetuity.
>>>>> >
>>>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In 
>>>>> > this topology, Cassandra stores 9× copies of the database forever - 
>>>>> > huge storage amplification. Witnesses allow users to maintain a voting 
>>>>> > quorum of 9 members (3× per DC); but reduce the durable replicas to 2× 
>>>>> > per DC – e.g., two durable replicas and one witness. This maintains the 
>>>>> > availability properties of an RF=3×3 topology while reducing storage 
>>>>> > costs by 33%, going from 9× copies to 6×.
>>>>> >
>>>>> > The role of a witness is to "witness" a write and persist it until it 
>>>>> > has been reconciled among all durable replicas; and to respond to read 
>>>>> > requests for witnessed writes awaiting reconciliation. Note that 
>>>>> > witnesses don't introduce a dedicated role for a node – whether a node 
>>>>> > is a durable replica or witness for a token just depends on its 
>>>>> > position in the ring.
>>>>> >
>>>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety 
>>>>> > property of the witness: guaranteeing that writes have been persisted 
>>>>> > to all durable replicas before becoming purgeable. CEP-45's journal and 
>>>>> > reconciliation design provide a great mechanism to ensure this while 
>>>>> > avoiding the write amplification of incremental repair and 
>>>>> > anticompaction.
>>>>> >
>>>>> > Take a look at the CEP if you're interested - happy to answer questions 
>>>>> > and discuss further: 
>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>>>>> >
>>>>> > – Scott
>>>>> >
>>>>> > [1] https://cloud.google.com/spanner/docs/replication
>>>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>>>>> >
>>>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>> >>
>>>>> >> Hi all,
>>>>> >>
>>>>> >> The CEP is available here: 
>>>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>>>> >>
>>>>> >> We would like to propose CEP-46: Finish Transient 
>>>>> >> Replication/Witnesses for adoption by the community. CEP-46 would 
>>>>> >> rename transient replication to witnesses and leverage mutation 
>>>>> >> tracking to implement witnesses as CEP-45 Mutation Tracking based Log 
>>>>> >> Replicas as a replacement for incremental repair based witnesses.
>>>>> >>
>>>>> >> For those not familiar with transient replication it would have the 
>>>>> >> keyspace replication settings declare some replicas as transient and 
>>>>> >> when incremental repair runs the transient replicas would delete data 
>>>>> >> instead of moving it into the repaired set.
>>>>> >>
>>>>> >> With log replicas nodes only  materialize mutations in their local LSM 
>>>>> >> for ranges where they are full replicas and not witnesses. For witness 
>>>>> >> ranges a node will write mutations to their local mutation tracking 
>>>>> >> log and participate in background and read time reconciliation. This 
>>>>> >> saves the compaction overhead of IR based witnesses which have to 
>>>>> >> materialize and perform compaction on all mutations even those being 
>>>>> >> applied to witness ranges.
>>>>> >>
>>>>> >> This would address one of the biggest issues with witnesses which is 
>>>>> >> the lack of monotonic reads. Implementation complexity wise this would 
>>>>> >> actually delete code compared to what would be required to complete IR 
>>>>> >> based witnesses because most of the heavy lifting is already done by 
>>>>> >> mutation tracking.
>>>>> >>
>>>>> >> Log replicas also makes it much more practical to realize the cost 
>>>>> >> savings of witnesses because log replicas have easier to characterize 
>>>>> >> resource consumption requirements (write rate * 
>>>>> >> recovery/reconfiguration time) and target a 10x improvement in write 
>>>>> >> throughput.  This makes knowing how much capacity can be omitted safer 
>>>>> >> and easier.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Ariel
>>>>> >
>>>

Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to