Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
> They are not two unordered sets, but rather two sets ordered by different
keys.

I think this is a distinction without a difference. Merkle tree repair
works because the ordering of the data is mostly the same across nodes.


On Thu, May 15, 2025 at 9:27 AM Runtian Liu  wrote:

> > what we're trying to achieve here is comparing two massive unordered
> sets.
>
> They are not two unordered sets, but rather two sets ordered by different
> keys. This means that when building Merkle trees for the base table and the
> materialized view (MV), we need to use different strategies to ensure the
> trees can be meaningfully compared.
>
> To address scalability concerns for MV repair, I’ve included a comparison
> between one round of full repair and MV repair in the table below. This
> comparison is also added to the CEP.
>
> n: number of rows to repair (Total rows in the table)
>
> d: depth of one Merkle tree for MV repair
>
> r: number of split ranges
>
> p: data compacted away
>
>
> This comparison focuses on the complexities of one round of full repair
> with a replication factor of 2 versus repairing a single MV based on one
> base table replica.
>
> Full Repair
>
> MV Repair
>
> Comment
>
> Extra disk used
>
> 0
>
> O(2*p)
>
> Since we take a snapshot at the beginning of the repair, any disk space
> that would normally be freed by compaction will remain occupied until the
> Merkle trees are successfully built and the snapshot is cleared.
>
> Data scan complexity
>
> O(2*n)
>
> O(2*n)
>
> Full repair scans n rows from the primary and n from replicas.
>
> MV repair scans n rows from the base table primary replica only, and n
> from the MV primary replica only.
>
> Merkle Tree building time complexity
>
> O(n)
>
> O(n*d)
>
> In full repair, Merkle tree building is O(1) per row—each hash is added
> sequentially to the leaf nodes.
>
> In MV repair, each hash is inserted from the root, making it O(d) per
> row. Since d is typically small (less than 20 and often smaller than in
> full repair), this isn’t a major concern.
>
> Total Merkle tree count
>
> O(2*r)
>
> O(2*r^2)
>
> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
> concern as they can be persisted to disk during the repair process.
>
> Merkle tree comparison complexity
>
> O(n)
>
> O(n)
>
> Assuming one row maps to one leaf node, both repairs are equivalent.
>
> Stream time complexity
>
> O(n)
>
> O(n)
>
> Assuming all rows need to be streamed, both repairs are equivalent.
>
> In short: MV repair consumes temporary disk space and a small, usually
> negligible amount of extra CPU for tree construction; other costs match
> full repair.
>
> The core idea behind the proposed MV repair is as follows:
>
>1.
>
>Take a snapshot to “freeze” the current state of both the base table
>and its MV.
>2.
>
>Gradually scan the data from both tables to build Merkle trees.
>3.
>
>Identify the token ranges where inconsistencies exist.
>4.
>
>Rebuild only the mismatched ranges rather than the entire MV.
>
> With transaction-backed MVs, step 4 should rarely be necessary.
>
> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie 
> wrote:
>
>> I think in order to address this, the view should be propagated to the
>> base replicas *after* it's accepted by all or a majority of base replicas.
>> This is where I think mutation tracking could probably help.
>>
>> Yeah, the idea of "don't reflect in the MV until you hit the CL the user
>> requested for the base table". Introduces disjoint risk if you have
>> coordinator death mid-write where replicas got base-data but that 2nd step
>> didn't take place; think that's why Runtien et. al are looking at paxos
>> repair picking up those pieces for you after the fact to get you back into
>> consistency. Mutation tracking and Accord both have similar guarantees in
>> this space.
>>
>> I think this would ensure that as long as there's no data loss or
>> bit-rot, the base and view can be repaired independently. When there is
>> data loss or bit-rot in either the base table or the view, then it is the
>> same as 2i today: rebuild is required.
>>
>> And the repair as proposed in the CEP should resolve the bitrot and bug
>> dataloss case I think. Certainly has much higher time complexity but the
>> bounding of memory complexity to be comparable with regular repair doesn't
>> strike me as a dealbreaker.
>>
>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>>
>> >  I think requiring a rebuild is a deal breaker for most teams.  In most
>> instances it would be having to also expand the cluster to handle the
>> additional disk requirements.  It turns an inconsistency problem into a
>> major operational headache that can take weeks to resolve.
>>
>> Agreed. The rebuild would not be required during normal operations when
>> the cluster is properly maintained (ie. regular repair) - only in
>> catastrophic situations.   This is also the case for ordinary tables
>> currently: if there's data loss, 

Re: [VOTE] CEP-46: Finish Transient Replication/Witnesses

2025-05-15 Thread Ariel Weisberg
Hi,

With 15 binding +1s and no -1s the vote passes.

Thanks!
Ariel

On Tue, May 13, 2025, at 3:48 AM, Mick Semb Wever wrote:
> 
>  .
>  
> 
>> The vote will be open for 72 hours. A vote passes if there are at least 3 
>> binding +1s and no binding vetoes.
> 
> 
> 
> +1
> 
> 


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
The previous table compared the complexity of full repair and MV repair
when reconciling one dataset with another. In production, we typically use
a replication factor of 3 in one datacenter. This means full repair
involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
Below is an updated comparison table reflecting this scenario.

n: number of rows to repair (Total rows in the table)

d: depth of one Merkle tree for MV repair

r: number of split ranges

p: data compacted away


This comparison focuses on the complexities of one round of full repair
with a replication factor of 3 versus repairing a single MV based on one
base table with replication factor 3.

Full Repair

MV Repair

Comment

Extra disk used

0

O(2*p)

Since we take a snapshot at the beginning of the repair, any disk space
that would normally be freed by compaction will remain occupied until the
Merkle trees are successfully built and the snapshot is cleared.

Data scan complexity

O(3*n)

O(6*n)

Full repair scans n rows from the primary and 2n from replicas.3

MV repair scans 3n rows from the base table and 3n from the MV.

Merkle Tree building time complexity

O(3n)

O(6*n*d)

In full repair, Merkle tree building is O(1) per row—each hash is added
sequentially to the leaf nodes.

In MV repair, each hash is inserted from the root, making it O(d) per row.
Since d is typically small (less than 20 and often smaller than in full
repair), this isn’t a major concern.

Total Merkle tree count

O(3*r)

O(6*r^2)

MV repair needs to generate more, smaller Merkle trees, but this isn’t a
concern as they can be persisted to disk during the repair process.

Merkle tree comparison complexity

O(3n)

O(3n)

Assuming one row maps to one leaf node, both repairs are equivalent.

Stream time complexity

O(3n)

O(3n)

Assuming all rows need to be streamed, both repairs are equivalent.

In short: Even for production use cases having RF=3 in one data center, we
can see that the MV repair consumes temporary disk space and a small,
usually negligible amount of extra CPU for tree construction; other costs
match full repair.

Additionally, with the online path proposed in this CEP, we expect
mismatches to be rare, which can lower the frequency of running this repair
process compared to full repair.


On Thu, May 15, 2025 at 9:53 AM Jon Haddad  wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think this is a distinction without a difference. Merkle tree repair
> works because the ordering of the data is mostly the same across nodes.
>
>
> On Thu, May 15, 2025 at 9:27 AM Runtian Liu  wrote:
>
>> > what we're trying to achieve here is comparing two massive unordered
>> sets.
>>
>> They are not two unordered sets, but rather two sets ordered by different
>> keys. This means that when building Merkle trees for the base table and the
>> materialized view (MV), we need to use different strategies to ensure the
>> trees can be meaningfully compared.
>>
>> To address scalability concerns for MV repair, I’ve included a comparison
>> between one round of full repair and MV repair in the table below. This
>> comparison is also added to the CEP.
>>
>> n: number of rows to repair (Total rows in the table)
>>
>> d: depth of one Merkle tree for MV repair
>>
>> r: number of split ranges
>>
>> p: data compacted away
>>
>>
>> This comparison focuses on the complexities of one round of full repair
>> with a replication factor of 2 versus repairing a single MV based on one
>> base table replica.
>>
>> Full Repair
>>
>> MV Repair
>>
>> Comment
>>
>> Extra disk used
>>
>> 0
>>
>> O(2*p)
>>
>> Since we take a snapshot at the beginning of the repair, any disk space
>> that would normally be freed by compaction will remain occupied until the
>> Merkle trees are successfully built and the snapshot is cleared.
>>
>> Data scan complexity
>>
>> O(2*n)
>>
>> O(2*n)
>>
>> Full repair scans n rows from the primary and n from replicas.
>>
>> MV repair scans n rows from the base table primary replica only, and n
>> from the MV primary replica only.
>>
>> Merkle Tree building time complexity
>>
>> O(n)
>>
>> O(n*d)
>>
>> In full repair, Merkle tree building is O(1) per row—each hash is added
>> sequentially to the leaf nodes.
>>
>> In MV repair, each hash is inserted from the root, making it O(d) per
>> row. Since d is typically small (less than 20 and often smaller than in
>> full repair), this isn’t a major concern.
>>
>> Total Merkle tree count
>>
>> O(2*r)
>>
>> O(2*r^2)
>>
>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
>> concern as they can be persisted to disk during the repair process.
>>
>> Merkle tree comparison complexity
>>
>> O(n)
>>
>> O(n)
>>
>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>
>> Stream time complexity
>>
>> O(n)
>>
>> O(n)
>>
>> Assuming all rows need to be streamed, both repairs are equivalent.
>>
>> In short: MV repair consumes temporary disk sp

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Josh McKenzie
There's bi-directional entropy issues with MV's - either orphaned view data or 
missing view data; that's why you kind of need a "bi-directional ETL" to make 
sure the 2 agree with each other. While normal repair would resolve the 
"missing data in MV" case, it wouldn't resolve the "data in MV that's not in 
base table anymore" case, which afaict all base consistency approaches (status 
quo, PaxosV2, Accord, Mutation Tracking) are vulnerable to.

It'd be correct (if operationally disappointing) to be able to just say "if you 
have data loss in your base table you need to rebuild the corresponding MV's", 
but the problem is operators aren't always going to know when that data loss 
occurs. Not everything is as visible as a lost quorum of replicas or blown up 
SSTables.

On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
> Maybe, I’m not really familiar enough with how “classic” MV repair works to 
> say. You can’t mix normal repair and mutation reconciliation in the current 
> incarnation of mutation tracking though, so I wouldn’t assume it would work 
> with MVs.
> 
> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just 
>> the MV against the other nodes) resolve the issue?
>> 
>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston  
>> wrote:
>>> __
>>> Mutation tracking is definitely an approach you could take for MVs. 
>>> Mutation reconciliation could be extended to ensure all changes have been 
>>> replicated to the views. When a base table received a mutation w/ an id it 
>>> would generate a view update. If you block marking a given mutation id as 
>>> reconciled until it’s been fully replicated to the base table and its view 
>>> updates have been fully replicated to the views, then all view updates will 
>>> eventually be applied as part of the log reconciliation process.
>>> 
>>> A mutation tracking implementation would also allow you to be more flexible 
>>> with the types of consistency levels you can work with, allowing users to 
>>> do things like use LOCAL_QUORUM without leaving themselves open to 
>>> introducing view inconsistencies.
>>> 
>>> That would more or less eliminate the need for any MV repair in normal 
>>> usage, but wouldn't address how to repair issues caused by bugs or data 
>>> loss, though you may be able to do something with comparing the latest 
>>> mutation ids for the base tables and its view ranges.
>>> 
>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
 I don't see mutation tracking [1] mentioned in this thread or in the 
 CEP-48 description. Not sure this would fit into the scope of this initial 
 CEP, but I have a feeling that mutation tracking could be potentially 
 helpful to reconcile base tables and views ?
 
 For example, when both base and view updates are acknowledged then this 
 could be somehow persisted in the view sstables mutation tracking 
 summary[2] or similar metadata ? Then these updates would be skipped 
 during view repair, considerably reducing the amount of work needed, since 
 only un-acknowledged views updates would need to be reconciled.
 
 [1] - 
 https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
  
 
 [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
 
 On Wed, May 14, 2025 at 12:59 PM Paulo Motta  
 wrote:
> > - The first thing I notice is that we're talking about repairing the 
> > entire table across the entire cluster all in one go.  It's been a 
> > *long* time since I tried to do a full repair of an entire table 
> > without using sub-ranges.  Is anyone here even doing that with clusters 
> > of non-trivial size?  How long does a full repair of a 100 node cluster 
> > with 5TB / node take even in the best case scenario?
> 
> I haven't checked the CEP yet so I may be missing out something but I 
> think this effort doesn't need to be conflated with dense node support, 
> to make this more approachable. I think prospective users would be OK 
> with overprovisioning to make this feasible if needed. We could perhaps 
> have size guardrails that limit the maximum table size per node when MVs 
> are enabled. Ideally we should make it work for dense nodes if possible, 
> but this shouldn't be a reason not to support the feature if it can be 
> made to work reasonably with more resources.
> 
> I think the main issue with the current MV is about correctness, and the 
> ultimate goal of the CEP must be to provide correctness guarantees, even 
> if it has an inevitable performance hit. I think that the performance of 
> the repair process is definitely an important consideration and it would 
> be helpful to have some benchmarks to have an idea of how long this 
> repair 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Paulo Motta
> There's bi-directional entropy issues with MV's - either orphaned view
data or missing view data; that's why you kind of need a "bi-directional
ETL" to make sure the 2 agree with each other. While normal repair would
resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
that's not in base table anymore" case, which afaict all base consistency
approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
to.

I don't think that bi-directional reconciliation should be a requirement,
when the base table is assumed to be the source of truth as stated in the
CEP doc.

I think the main issue with the current MV implementation is that each view
replica is independently replicated by the base replica, before the base
write is acknowledged.

This creates a correctness issue in the write path, because a view update
can be created for a write that was not accepted by the coordinator in the
following scenario:

N=RF=3
CL=ONE
- Update U is propagated to view replica V, coordinator that is also base
replica B dies before accepting base table write request to client. Now U
exists in V but not in B.

I think in order to address this, the view should be propagated to the base
replicas *after* it's accepted by all or a majority of base replicas. This
is where I think mutation tracking could probably help.

I think this would ensure that as long as there's no data loss or bit-rot,
the base and view can be repaired independently. When there is data loss or
bit-rot in either the base table or the view, then it is the same as 2i
today: rebuild is required.

>  It'd be correct (if operationally disappointing) to be able to just say
"if you have data loss in your base table you need to rebuild the
corresponding MV's", but the problem is operators aren't always going to
know when that data loss occurs. Not everything is as visible as a lost
quorum of replicas or blown up SSTables.

I think there are opportunities to improve rebuild speed, assuming the base
table as a source of truth. For example, rebuild only subranges when
data-loss is detected.

On Thu, May 15, 2025 at 8:07 AM Josh McKenzie  wrote:

> There's bi-directional entropy issues with MV's - either orphaned view
> data or missing view data; that's why you kind of need a "bi-directional
> ETL" to make sure the 2 agree with each other. While normal repair would
> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
> that's not in base table anymore" case, which afaict all base consistency
> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
> to.
>
> It'd be correct (if operationally disappointing) to be able to just say
> "if you have data loss in your base table you need to rebuild the
> corresponding MV's", but the problem is operators aren't always going to
> know when that data loss occurs. Not everything is as visible as a lost
> quorum of replicas or blown up SSTables.
>
> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>
> Maybe, I’m not really familiar enough with how “classic” MV repair works
> to say. You can’t mix normal repair and mutation reconciliation in the
> current incarnation of mutation tracking though, so I wouldn’t assume it
> would work with MVs.
>
> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>
> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just
> the MV against the other nodes) resolve the issue?
>
> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
> wrote:
>
>
> Mutation tracking is definitely an approach you could take for MVs.
> Mutation reconciliation could be extended to ensure all changes have been
> replicated to the views. When a base table received a mutation w/ an id it
> would generate a view update. If you block marking a given mutation id as
> reconciled until it’s been fully replicated to the base table and its view
> updates have been fully replicated to the views, then all view updates will
> eventually be applied as part of the log reconciliation process.
>
> A mutation tracking implementation would also allow you to be more
> flexible with the types of consistency levels you can work with, allowing
> users to do things like use LOCAL_QUORUM without leaving themselves open to
> introducing view inconsistencies.
>
> That would more or less eliminate the need for any MV repair in normal
> usage, but wouldn't address how to repair issues caused by bugs or data
> loss, though you may be able to do something with comparing the latest
> mutation ids for the base tables and its view ranges.
>
> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>
> I don't see mutation tracking [1] mentioned in this thread or in the
> CEP-48 description. Not sure this would fit into the scope of this
> initial CEP, but I have a feeling that mutation tracking could be
> potentially helpful to reconcile base tables and views ?
>
> For example, when both base and view updates are acknowledged then this
> could be somehow persiste

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
I think requiring a rebuild is a deal breaker for most teams.  In most
instances it would be having to also expand the cluster to handle the
additional disk requirements.  It turns an inconsistency problem into a
major operational headache that can take weeks to resolve.





On Thu, May 15, 2025 at 7:02 AM Paulo Motta 
wrote:

> > There's bi-directional entropy issues with MV's - either orphaned view
> data or missing view data; that's why you kind of need a "bi-directional
> ETL" to make sure the 2 agree with each other. While normal repair would
> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
> that's not in base table anymore" case, which afaict all base consistency
> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
> to.
>
> I don't think that bi-directional reconciliation should be a requirement,
> when the base table is assumed to be the source of truth as stated in the
> CEP doc.
>
> I think the main issue with the current MV implementation is that each
> view replica is independently replicated by the base replica, before the
> base write is acknowledged.
>
> This creates a correctness issue in the write path, because a view update
> can be created for a write that was not accepted by the coordinator in the
> following scenario:
>
> N=RF=3
> CL=ONE
> - Update U is propagated to view replica V, coordinator that is also base
> replica B dies before accepting base table write request to client. Now U
> exists in V but not in B.
>
> I think in order to address this, the view should be propagated to the
> base replicas *after* it's accepted by all or a majority of base replicas.
> This is where I think mutation tracking could probably help.
>
> I think this would ensure that as long as there's no data loss or bit-rot,
> the base and view can be repaired independently. When there is data loss or
> bit-rot in either the base table or the view, then it is the same as 2i
> today: rebuild is required.
>
> >  It'd be correct (if operationally disappointing) to be able to just say
> "if you have data loss in your base table you need to rebuild the
> corresponding MV's", but the problem is operators aren't always going to
> know when that data loss occurs. Not everything is as visible as a lost
> quorum of replicas or blown up SSTables.
>
> I think there are opportunities to improve rebuild speed, assuming the
> base table as a source of truth. For example, rebuild only subranges when
> data-loss is detected.
>
> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie 
> wrote:
>
>> There's bi-directional entropy issues with MV's - either orphaned view
>> data or missing view data; that's why you kind of need a "bi-directional
>> ETL" to make sure the 2 agree with each other. While normal repair would
>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>> that's not in base table anymore" case, which afaict all base consistency
>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>> to.
>>
>> It'd be correct (if operationally disappointing) to be able to just say
>> "if you have data loss in your base table you need to rebuild the
>> corresponding MV's", but the problem is operators aren't always going to
>> know when that data loss occurs. Not everything is as visible as a lost
>> quorum of replicas or blown up SSTables.
>>
>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>
>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>> to say. You can’t mix normal repair and mutation reconciliation in the
>> current incarnation of mutation tracking though, so I wouldn’t assume it
>> would work with MVs.
>>
>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>
>> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just
>> the MV against the other nodes) resolve the issue?
>>
>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
>> wrote:
>>
>>
>> Mutation tracking is definitely an approach you could take for MVs.
>> Mutation reconciliation could be extended to ensure all changes have been
>> replicated to the views. When a base table received a mutation w/ an id it
>> would generate a view update. If you block marking a given mutation id as
>> reconciled until it’s been fully replicated to the base table and its view
>> updates have been fully replicated to the views, then all view updates will
>> eventually be applied as part of the log reconciliation process.
>>
>> A mutation tracking implementation would also allow you to be more
>> flexible with the types of consistency levels you can work with, allowing
>> users to do things like use LOCAL_QUORUM without leaving themselves open to
>> introducing view inconsistencies.
>>
>> That would more or less eliminate the need for any MV repair in normal
>> usage, but wouldn't address how to repair issues caused by bugs or data
>> loss, though you may be able to do something with comparing the latest
>> mutation ids for the 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Paulo Motta
>  I think requiring a rebuild is a deal breaker for most teams.  In most
instances it would be having to also expand the cluster to handle the
additional disk requirements.  It turns an inconsistency problem into a
major operational headache that can take weeks to resolve.

Agreed. The rebuild would not be required during normal operations when the
cluster is properly maintained (ie. regular repair) - only in catastrophic
situations.   This is also the case for ordinary tables currently: if
there's data loss, then restoring from a backup is needed. This could be a
possible alternative to not require a rebuild in this extraordinary
scenario.

On Thu, May 15, 2025 at 10:14 AM Jon Haddad  wrote:

> I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
>
>
>
>
> On Thu, May 15, 2025 at 7:02 AM Paulo Motta 
> wrote:
>
>> > There's bi-directional entropy issues with MV's - either orphaned view
>> data or missing view data; that's why you kind of need a "bi-directional
>> ETL" to make sure the 2 agree with each other. While normal repair would
>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>> that's not in base table anymore" case, which afaict all base consistency
>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>> to.
>>
>> I don't think that bi-directional reconciliation should be a requirement,
>> when the base table is assumed to be the source of truth as stated in the
>> CEP doc.
>>
>> I think the main issue with the current MV implementation is that each
>> view replica is independently replicated by the base replica, before the
>> base write is acknowledged.
>>
>> This creates a correctness issue in the write path, because a view update
>> can be created for a write that was not accepted by the coordinator in the
>> following scenario:
>>
>> N=RF=3
>> CL=ONE
>> - Update U is propagated to view replica V, coordinator that is also base
>> replica B dies before accepting base table write request to client. Now U
>> exists in V but not in B.
>>
>> I think in order to address this, the view should be propagated to the
>> base replicas *after* it's accepted by all or a majority of base replicas.
>> This is where I think mutation tracking could probably help.
>>
>> I think this would ensure that as long as there's no data loss or
>> bit-rot, the base and view can be repaired independently. When there is
>> data loss or bit-rot in either the base table or the view, then it is the
>> same as 2i today: rebuild is required.
>>
>> >  It'd be correct (if operationally disappointing) to be able to just
>> say "if you have data loss in your base table you need to rebuild the
>> corresponding MV's", but the problem is operators aren't always going to
>> know when that data loss occurs. Not everything is as visible as a lost
>> quorum of replicas or blown up SSTables.
>>
>> I think there are opportunities to improve rebuild speed, assuming the
>> base table as a source of truth. For example, rebuild only subranges when
>> data-loss is detected.
>>
>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie 
>> wrote:
>>
>>> There's bi-directional entropy issues with MV's - either orphaned view
>>> data or missing view data; that's why you kind of need a "bi-directional
>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>> that's not in base table anymore" case, which afaict all base consistency
>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>> to.
>>>
>>> It'd be correct (if operationally disappointing) to be able to just say
>>> "if you have data loss in your base table you need to rebuild the
>>> corresponding MV's", but the problem is operators aren't always going to
>>> know when that data loss occurs. Not everything is as visible as a lost
>>> quorum of replicas or blown up SSTables.
>>>
>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>
>>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>>> to say. You can’t mix normal repair and mutation reconciliation in the
>>> current incarnation of mutation tracking though, so I wouldn’t assume it
>>> would work with MVs.
>>>
>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>
>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>> (just the MV against the other nodes) resolve the issue?
>>>
>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston 
>>> wrote:
>>>
>>>
>>> Mutation tracking is definitely an approach you could take for MVs.
>>> Mutation reconciliation could be extended to ensure all changes have been
>>> replicated to the views. When a base table received a mutation w/ an id it
>>> would gener

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Josh McKenzie
> I think in order to address this, the view should be propagated to the base 
> replicas *after* it's accepted by all or a majority of base replicas. This is 
> where I think mutation tracking could probably help.
Yeah, the idea of "don't reflect in the MV until you hit the CL the user 
requested for the base table". Introduces disjoint risk if you have coordinator 
death mid-write where replicas got base-data but that 2nd step didn't take 
place; think that's why Runtien et. al are looking at paxos repair picking up 
those pieces for you after the fact to get you back into consistency. Mutation 
tracking and Accord both have similar guarantees in this space.

> I think this would ensure that as long as there's no data loss or bit-rot, 
> the base and view can be repaired independently. When there is data loss or 
> bit-rot in either the base table or the view, then it is the same as 2i 
> today: rebuild is required.
And the repair as proposed in the CEP should resolve the bitrot and bug 
dataloss case I think. Certainly has much higher time complexity but the 
bounding of memory complexity to be comparable with regular repair doesn't 
strike me as a dealbreaker.

On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
> >  I think requiring a rebuild is a deal breaker for most teams.  In most 
> > instances it would be having to also expand the cluster to handle the 
> > additional disk requirements.  It turns an inconsistency problem into a 
> > major operational headache that can take weeks to resolve.
> 
> Agreed. The rebuild would not be required during normal operations when the 
> cluster is properly maintained (ie. regular repair) - only in catastrophic 
> situations.   This is also the case for ordinary tables currently: if there's 
> data loss, then restoring from a backup is needed. This could be a possible 
> alternative to not require a rebuild in this extraordinary scenario.
> 
> On Thu, May 15, 2025 at 10:14 AM Jon Haddad  wrote:
>> I think requiring a rebuild is a deal breaker for most teams.  In most 
>> instances it would be having to also expand the cluster to handle the 
>> additional disk requirements.  It turns an inconsistency problem into a 
>> major operational headache that can take weeks to resolve.
>> 
>> 
>> 
>> 
>> 
>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta  wrote:
>>> > There's bi-directional entropy issues with MV's - either orphaned view 
>>> > data or missing view data; that's why you kind of need a "bi-directional 
>>> > ETL" to make sure the 2 agree with each other. While normal repair would 
>>> > resolve the "missing data in MV" case, it wouldn't resolve the "data in 
>>> > MV that's not in base table anymore" case, which afaict all base 
>>> > consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) 
>>> > are vulnerable to.
>>> 
>>> I don't think that bi-directional reconciliation should be a requirement, 
>>> when the base table is assumed to be the source of truth as stated in the 
>>> CEP doc.
>>> 
>>> I think the main issue with the current MV implementation is that each view 
>>> replica is independently replicated by the base replica, before the base 
>>> write is acknowledged.
>>> 
>>> This creates a correctness issue in the write path, because a view update 
>>> can be created for a write that was not accepted by the coordinator in the 
>>> following scenario:
>>> 
>>> N=RF=3
>>> CL=ONE
>>> - Update U is propagated to view replica V, coordinator that is also base 
>>> replica B dies before accepting base table write request to client. Now U 
>>> exists in V but not in B.
>>> 
>>> I think in order to address this, the view should be propagated to the base 
>>> replicas *after* it's accepted by all or a majority of base replicas. This 
>>> is where I think mutation tracking could probably help.
>>> 
>>> I think this would ensure that as long as there's no data loss or bit-rot, 
>>> the base and view can be repaired independently. When there is data loss or 
>>> bit-rot in either the base table or the view, then it is the same as 2i 
>>> today: rebuild is required.
>>> 
>>> >  It'd be correct (if operationally disappointing) to be able to just say 
>>> > "if you have data loss in your base table you need to rebuild the 
>>> > corresponding MV's", but the problem is operators aren't always going to 
>>> > know when that data loss occurs. Not everything is as visible as a lost 
>>> > quorum of replicas or blown up SSTables.
>>> 
>>> I think there are opportunities to improve rebuild speed, assuming the base 
>>> table as a source of truth. For example, rebuild only subranges when 
>>> data-loss is detected.
>>> 
>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie  wrote:
 __
 There's bi-directional entropy issues with MV's - either orphaned view 
 data or missing view data; that's why you kind of need a "bi-directional 
 ETL" to make sure the 2 agree with each other. While normal repair would 
 resolve the "missing data in MV" 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
There's a lot here that's still confusing to me.  Maybe you can help me
understand it better?  Apologies in advance for the text wall :)

I'll use this schema as an example:

-
CREATE TABLE test.t1 (
id int PRIMARY KEY,
v1 int
);

create MATERIALIZED VIEW  test_mv as
SELECT v1, id from test.t1 where id is not null and v1 is not null primary
key (v1, id);
-

We've got (id, v1) in the base table and (v1, id) in the MV.

During the repair, we snapshot, and construct a whole bunch of merkle
trees.  CEP-48 says they will be persisted to disk.

** *Do you intend on building all the Merkle trees in parallel?
* Will there be hundreds of files doing random IO to persist the trees to
disk, in addition to the sequential IO from repair?
* Is the intention of persisting the trees to disk to recover from failure,
or just to limit memory usage?
** *Have you calculated the Merkle tree space requirements?
* When do we build the Merkle trees for the view?  Is that happening in
parallel with the base table?  Do we have the computational complexity of 2
full cluster repairs running simultaneously, or does it take twice as long?

I'm very curious to hear if anyone has run a full cluster repair recently
on a non-trivial dataset.  Every cluster I work with only does subrange
repair.  I can't even recall the last time I did a full repair on a large
cluster.  I may never have, now that I think about it.  Every time I've
done this in the past it's been plagued with issues, both in terms of
performance and reliability.  Subrange repair works because it can make
progress in 15-30 minute increments.

Anyways - moving on...

You suggest we read the base table and construct the Merkle trees based on
the transformed rows. Using my schema above, we take the v1 field and use
token(v1), to build the tree.  Assuming that a value for v1 appears many
times throughout the dataset across many partitions, how do you intend on
calculating it's hash?  If you look at Validator.rowHash [1] and
Validator.add, you'll see it's taking an UnfilteredRowIterator for an
entire partition and calculates the hash based on that.  Here's the comment:

 /**
 * Called (in order) for every row present in the CF.
 * Hashes the row, and adds it to the tree being built.
 *
 * @param partition Partition to add hash
 */
public void add(UnfilteredRowIterator partition)

So it seems to me like you need to have the entire partition materialized
in memory before adding to the tree.Doing that per value v1 without an
index is pretty much impossible - we'd have to scan the entire dataset once
per partition to pull out all the matching v1 values, or you'd need to
materialize the entire dataset into a local version of the MV for that
range. I don't know how you could do this.  Do you have a workaround for
this planned?  Maybe someone that knows the Merkle tree code better can
chime in.

Maybe there's something else here I'm not aware of - please let me know
what I'm missing here if I am, it would be great to see this in the doc if
you have a solution.

For the sake of discussion, let's assume we've moved past this and we have
our tree for a hundreds of ranges built from the base table & built for the
MV, now we move onto the comparison.

In the doc at this point, we delete the snapshot because we have the tree
structures and we compare Merkle trees.  Then we stream mismatched data.

So let's say we find a mismatch in a hash.  That indicates that there's
some range of data where we have an issue.  For some token range calculated
from the v1 field, we have a mismatch, right?  What do we do with that
information?

* Do we tell the node that owned the base table - hey, stream the data from
base where token(v1) is in range [X,Y) to me?
* That means we have to scan through the base again for all rows where
token(v1) in [X,Y) range, right?  Because without an index on the hashes of
v1, we're doing a full table scan and hashing every v1 value to find out if
it needs to be streamed back to the MV.
* Are we doing this concurrently on all nodes?
* Will there be coordination between all nodes in the cluster to ensure you
don't have to do multiple scans?

I realized there's a lot of questions here, but unfortunately I'm having a
hard time seeing how we can workaround some of the core assumptions around
constructing Merkle trees and using them to resolve the differences in a
way that matches up with what's in the doc.  I have quite a few more things
to discuss, but I'll save them for a follow up once all these have been
sorted out.

Thanks in advance!
Jon

[1]
https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209



On Thu, May 15, 2025 at 10:10 AM Runtian Liu  wrote:

> The previous table compared the complexity of full repair and MV repair
> when reconciling one dataset with another. In production, we typically use
> a replication factor of 3 in one datacen

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Jon Haddad
One last thing.  I'm pretty sure building the tree requires the keys be
added in token order:
https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173

Which definitely introduces a bit of a problem, given that the tree would
be constructed from the transformed v1, which is a value unpredictable
enough to be considered random.

The only way I can think of to address this would be to maintain a local
index on v1.  See my previous email where I mentioned this.

Base Table -> Local Index -> Global Index

Still a really hard problem.

Jon



On Thu, May 15, 2025 at 6:12 PM Jon Haddad  wrote:

> There's a lot here that's still confusing to me.  Maybe you can help me
> understand it better?  Apologies in advance for the text wall :)
>
> I'll use this schema as an example:
>
> -
> CREATE TABLE test.t1 (
> id int PRIMARY KEY,
> v1 int
> );
>
> create MATERIALIZED VIEW  test_mv as
> SELECT v1, id from test.t1 where id is not null and v1 is not null primary
> key (v1, id);
> -
>
> We've got (id, v1) in the base table and (v1, id) in the MV.
>
> During the repair, we snapshot, and construct a whole bunch of merkle
> trees.  CEP-48 says they will be persisted to disk.
>
> ** *Do you intend on building all the Merkle trees in parallel?
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
> ** *Have you calculated the Merkle tree space requirements?
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> I'm very curious to hear if anyone has run a full cluster repair recently
> on a non-trivial dataset.  Every cluster I work with only does subrange
> repair.  I can't even recall the last time I did a full repair on a large
> cluster.  I may never have, now that I think about it.  Every time I've
> done this in the past it's been plagued with issues, both in terms of
> performance and reliability.  Subrange repair works because it can make
> progress in 15-30 minute increments.
>
> Anyways - moving on...
>
> You suggest we read the base table and construct the Merkle trees based on
> the transformed rows. Using my schema above, we take the v1 field and use
> token(v1), to build the tree.  Assuming that a value for v1 appears many
> times throughout the dataset across many partitions, how do you intend on
> calculating it's hash?  If you look at Validator.rowHash [1] and
> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
> entire partition and calculates the hash based on that.  Here's the comment:
>
>  /**
>  * Called (in order) for every row present in the CF.
>  * Hashes the row, and adds it to the tree being built.
>  *
>  * @param partition Partition to add hash
>  */
> public void add(UnfilteredRowIterator partition)
>
> So it seems to me like you need to have the entire partition materialized
> in memory before adding to the tree.Doing that per value v1 without an
> index is pretty much impossible - we'd have to scan the entire dataset once
> per partition to pull out all the matching v1 values, or you'd need to
> materialize the entire dataset into a local version of the MV for that
> range. I don't know how you could do this.  Do you have a workaround for
> this planned?  Maybe someone that knows the Merkle tree code better can
> chime in.
>
> Maybe there's something else here I'm not aware of - please let me know
> what I'm missing here if I am, it would be great to see this in the doc if
> you have a solution.
>
> For the sake of discussion, let's assume we've moved past this and we have
> our tree for a hundreds of ranges built from the base table & built for the
> MV, now we move onto the comparison.
>
> In the doc at this point, we delete the snapshot because we have the tree
> structures and we compare Merkle trees.  Then we stream mismatched data.
>
> So let's say we find a mismatch in a hash.  That indicates that there's
> some range of data where we have an issue.  For some token range calculated
> from the v1 field, we have a mismatch, right?  What do we do with that
> information?
>
> * Do we tell the node that owned the base table - hey, stream the data
> from base where token(v1) is in range [X,Y) to me?
> * That means we have to scan through the base again for all rows where
> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
> v1, we're doing a full table scan and hashing every v1 value to find out if
> it needs to be streamed back to the MV.
> * Are we doing this concurrently on all nodes?
> * Will there be coordination between all nodes in the cluster to ensu

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
> what we're trying to achieve here is comparing two massive unordered
sets.

They are not two unordered sets, but rather two sets ordered by different
keys. This means that when building Merkle trees for the base table and the
materialized view (MV), we need to use different strategies to ensure the
trees can be meaningfully compared.

To address scalability concerns for MV repair, I’ve included a comparison
between one round of full repair and MV repair in the table below. This
comparison is also added to the CEP.

n: number of rows to repair (Total rows in the table)

d: depth of one Merkle tree for MV repair

r: number of split ranges

p: data compacted away


This comparison focuses on the complexities of one round of full repair
with a replication factor of 2 versus repairing a single MV based on one
base table replica.

Full Repair

MV Repair

Comment

Extra disk used

0

O(2*p)

Since we take a snapshot at the beginning of the repair, any disk space
that would normally be freed by compaction will remain occupied until the
Merkle trees are successfully built and the snapshot is cleared.

Data scan complexity

O(2*n)

O(2*n)

Full repair scans n rows from the primary and n from replicas.

MV repair scans n rows from the base table primary replica only, and n from
the MV primary replica only.

Merkle Tree building time complexity

O(n)

O(n*d)

In full repair, Merkle tree building is O(1) per row—each hash is added
sequentially to the leaf nodes.

In MV repair, each hash is inserted from the root, making it O(d) per row.
Since d is typically small (less than 20 and often smaller than in full
repair), this isn’t a major concern.

Total Merkle tree count

O(2*r)

O(2*r^2)

MV repair needs to generate more, smaller Merkle trees, but this isn’t a
concern as they can be persisted to disk during the repair process.

Merkle tree comparison complexity

O(n)

O(n)

Assuming one row maps to one leaf node, both repairs are equivalent.

Stream time complexity

O(n)

O(n)

Assuming all rows need to be streamed, both repairs are equivalent.

In short: MV repair consumes temporary disk space and a small, usually
negligible amount of extra CPU for tree construction; other costs match
full repair.

The core idea behind the proposed MV repair is as follows:

   1.

   Take a snapshot to “freeze” the current state of both the base table and
   its MV.
   2.

   Gradually scan the data from both tables to build Merkle trees.
   3.

   Identify the token ranges where inconsistencies exist.
   4.

   Rebuild only the mismatched ranges rather than the entire MV.

With transaction-backed MVs, step 4 should rarely be necessary.

On Thu, May 15, 2025 at 7:54 AM Josh McKenzie  wrote:

> I think in order to address this, the view should be propagated to the
> base replicas *after* it's accepted by all or a majority of base replicas.
> This is where I think mutation tracking could probably help.
>
> Yeah, the idea of "don't reflect in the MV until you hit the CL the user
> requested for the base table". Introduces disjoint risk if you have
> coordinator death mid-write where replicas got base-data but that 2nd step
> didn't take place; think that's why Runtien et. al are looking at paxos
> repair picking up those pieces for you after the fact to get you back into
> consistency. Mutation tracking and Accord both have similar guarantees in
> this space.
>
> I think this would ensure that as long as there's no data loss or bit-rot,
> the base and view can be repaired independently. When there is data loss or
> bit-rot in either the base table or the view, then it is the same as 2i
> today: rebuild is required.
>
> And the repair as proposed in the CEP should resolve the bitrot and bug
> dataloss case I think. Certainly has much higher time complexity but the
> bounding of memory complexity to be comparable with regular repair doesn't
> strike me as a dealbreaker.
>
> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>
> >  I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
> Agreed. The rebuild would not be required during normal operations when
> the cluster is properly maintained (ie. regular repair) - only in
> catastrophic situations.   This is also the case for ordinary tables
> currently: if there's data loss, then restoring from a backup is needed.
> This could be a possible alternative to not require a rebuild in this
> extraordinary scenario.
>
> On Thu, May 15, 2025 at 10:14 AM Jon Haddad 
> wrote:
>
> I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
>
>
>
>
> On Thu,