I think requiring a rebuild is a deal breaker for most teams. In most instances it would be having to also expand the cluster to handle the additional disk requirements. It turns an inconsistency problem into a major operational headache that can take weeks to resolve.
On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com> wrote: > > There's bi-directional entropy issues with MV's - either orphaned view > data or missing view data; that's why you kind of need a "bi-directional > ETL" to make sure the 2 agree with each other. While normal repair would > resolve the "missing data in MV" case, it wouldn't resolve the "data in MV > that's not in base table anymore" case, which afaict all base consistency > approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable > to. > > I don't think that bi-directional reconciliation should be a requirement, > when the base table is assumed to be the source of truth as stated in the > CEP doc. > > I think the main issue with the current MV implementation is that each > view replica is independently replicated by the base replica, before the > base write is acknowledged. > > This creates a correctness issue in the write path, because a view update > can be created for a write that was not accepted by the coordinator in the > following scenario: > > N=RF=3 > CL=ONE > - Update U is propagated to view replica V, coordinator that is also base > replica B dies before accepting base table write request to client. Now U > exists in V but not in B. > > I think in order to address this, the view should be propagated to the > base replicas *after* it's accepted by all or a majority of base replicas. > This is where I think mutation tracking could probably help. > > I think this would ensure that as long as there's no data loss or bit-rot, > the base and view can be repaired independently. When there is data loss or > bit-rot in either the base table or the view, then it is the same as 2i > today: rebuild is required. > > > It'd be correct (if operationally disappointing) to be able to just say > "if you have data loss in your base table you need to rebuild the > corresponding MV's", but the problem is operators aren't always going to > know when that data loss occurs. Not everything is as visible as a lost > quorum of replicas or blown up SSTables. > > I think there are opportunities to improve rebuild speed, assuming the > base table as a source of truth. For example, rebuild only subranges when > data-loss is detected. > > On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org> > wrote: > >> There's bi-directional entropy issues with MV's - either orphaned view >> data or missing view data; that's why you kind of need a "bi-directional >> ETL" to make sure the 2 agree with each other. While normal repair would >> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV >> that's not in base table anymore" case, which afaict all base consistency >> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable >> to. >> >> It'd be correct (if operationally disappointing) to be able to just say >> "if you have data loss in your base table you need to rebuild the >> corresponding MV's", but the problem is operators aren't always going to >> know when that data loss occurs. Not everything is as visible as a lost >> quorum of replicas or blown up SSTables. >> >> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote: >> >> Maybe, I’m not really familiar enough with how “classic” MV repair works >> to say. You can’t mix normal repair and mutation reconciliation in the >> current incarnation of mutation tracking though, so I wouldn’t assume it >> would work with MVs. >> >> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote: >> >> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just >> the MV against the other nodes) resolve the issue? >> >> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> >> wrote: >> >> >> Mutation tracking is definitely an approach you could take for MVs. >> Mutation reconciliation could be extended to ensure all changes have been >> replicated to the views. When a base table received a mutation w/ an id it >> would generate a view update. If you block marking a given mutation id as >> reconciled until it’s been fully replicated to the base table and its view >> updates have been fully replicated to the views, then all view updates will >> eventually be applied as part of the log reconciliation process. >> >> A mutation tracking implementation would also allow you to be more >> flexible with the types of consistency levels you can work with, allowing >> users to do things like use LOCAL_QUORUM without leaving themselves open to >> introducing view inconsistencies. >> >> That would more or less eliminate the need for any MV repair in normal >> usage, but wouldn't address how to repair issues caused by bugs or data >> loss, though you may be able to do something with comparing the latest >> mutation ids for the base tables and its view ranges. >> >> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote: >> >> I don't see mutation tracking [1] mentioned in this thread or in the >> CEP-48 description. Not sure this would fit into the scope of this >> initial CEP, but I have a feeling that mutation tracking could be >> potentially helpful to reconcile base tables and views ? >> >> For example, when both base and view updates are acknowledged then this >> could be somehow persisted in the view sstables mutation tracking >> summary[2] or similar metadata ? Then these updates would be skipped during >> view repair, considerably reducing the amount of work needed, since only >> un-acknowledged views updates would need to be reconciled. >> >> [1] - >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking| >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C> >> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336 >> >> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com> >> wrote: >> >> > - The first thing I notice is that we're talking about repairing the >> entire table across the entire cluster all in one go. It's been a *long* >> time since I tried to do a full repair of an entire table without using >> sub-ranges. Is anyone here even doing that with clusters of non-trivial >> size? How long does a full repair of a 100 node cluster with 5TB / node >> take even in the best case scenario? >> >> I haven't checked the CEP yet so I may be missing out something but I >> think this effort doesn't need to be conflated with dense node support, to >> make this more approachable. I think prospective users would be OK with >> overprovisioning to make this feasible if needed. We could perhaps have >> size guardrails that limit the maximum table size per node when MVs are >> enabled. Ideally we should make it work for dense nodes if possible, but >> this shouldn't be a reason not to support the feature if it can be made to >> work reasonably with more resources. >> >> I think the main issue with the current MV is about correctness, and the >> ultimate goal of the CEP must be to provide correctness guarantees, even if >> it has an inevitable performance hit. I think that the performance of the >> repair process is definitely an important consideration and it would be >> helpful to have some benchmarks to have an idea of how long this repair >> process would take for lightweight and denser tables. >> >> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >> I've got several concerns around this repair process. >> >> - The first thing I notice is that we're talking about repairing the >> entire table across the entire cluster all in one go. It's been a *long* >> time since I tried to do a full repair of an entire table without using >> sub-ranges. Is anyone here even doing that with clusters of non trivial >> size? How long does a full repair of a 100 node cluster with 5TB / node >> take even in the best case scenario? >> >> - Even in a scenario where sub-range repair is supported, you'd have to >> scan *every* sstable on the base table in order to construct the a merkle >> tree, as we don't know in advance which SSTables contain the ranges that >> the MV will. That means a subrange repair would have to do a *ton* of IO. >> Anyone who's mis-configured a sub-range incremental repair to use too many >> ranges will probably be familiar with how long it can take to anti-compact >> a bunch of SSTables. With MV sub-range repair, we'd have even more >> overhead, because we'd have to read in every SSTable, every time. If we do >> 10 subranges, we'll do 10x the IO of a normal repair. I don't think this >> is practical. >> >> - Merkle trees make sense when you're comparing tables with the same >> partition key, but I don't think they do when you're transforming a base >> table to a view. When there's a mis-match, what's transferred? We have a >> range of data in the MV, but now we have to go find that from the base >> table. That means the merkle tree needs to not just track the hashes and >> ranges, but the original keys it was transformed from, in order to go find >> all of the matching partitions in that mis-matched range. Either that or >> we end up rescanning the entire dataset in order to find the mismatches. >> >> Jon >> >> >> >> >> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> wrote: >> >> > Looking at the details of the CEP it seems to describe Paxos as >> PaxosV1, but PaxosV2 works slightly differently (it can read during the >> prepare phase). I assume that supporting Paxos means supporting both V1 and >> V2 for materialized views? >> We are going to support Paxos V2. The CEP is not clear on that, we add >> this to clarify that. >> >> It looks like the online portion is now fairly well understood. For the >> offline repair part, I see two main concerns: one around the scalability of >> the proposed approach, and another regarding how it handles tombstones. >> >> *Scalability:* >> I have added a *section* >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample> >> in the CEP with an example to compare full repair and the proposed MV >> repair, the overall scalability should not be a problem. >> >> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, >> where each node owns one token. The base table uses (pk, ck) as its primary >> key, while the materialized view (MV) uses (ck, pk) as its primary key. >> Both tables include a value column v, which allows us to correlate rows >> between them. The dataset consists of 16 records, distributed as follows: >> >> >> *Base table* >> (pk, ck, v) >> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1 >> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2 >> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3 >> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4 >> >> >> >> *Materialized view* >> (ck, pk, v) >> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1 >> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2 >> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3 >> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4 >> >> >> The chart below compares one round of full repair with one round of MV >> repair. As shown, both scan the same total number of rows. However, MV >> repair has higher time complexity because its Merkle tree processes each >> row more intensively. To avoid all nodes scanning the entire table >> simultaneously, MV repair should use a snapshot-based approach, similar to >> normal repair with the --sequential option. Time complexity increase >> compare to full repair can be found in the "Complexity and Memory >> Management" section. >> >> >> n: number of rows >> >> d: depth of one Merkle tree for MV repair >> >> d': depth of one Merkle tree for full repair >> >> r: number of split ranges >> >> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r. >> >> We can see that the space complexity is the same, while MV repair has >> higher time complexity. However, this should not pose a significant issue >> in production, as the Merkle tree depth and the number of split ranges are >> typically not large. >> >> >> 1 Round Merkle Tree Building Complexity >> Full Repair >> MV Repair >> Time complexity O(n) O(n*d*log(r)) >> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r) >> >> *Tombstone:* >> >> The current proposal focuses on rebuilding the MV for a granular token >> range where a mismatch is detected, rather than rebuilding the entire MV >> token range. Since the MV is treated as a regular table, standard full or >> incremental repair processes should still apply to both the base and MV >> tables to keep their replicas in sync. >> >> Regarding tombstones, if we introduce special tombstone types or handling >> mechanisms for the MV table, we may be able to support tombstone >> synchronization between the base table and the MV. I plan to spend more >> time exploring whether we can introduce changes to the base table that >> enable this synchronization. >> >> >> >> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >> >Like something doesn't add up here because if it always includes the >> base table's primary key columns that means >> >> The requirement for materialized views (MVs) to include the base table's >> primary key appears to be primarily a syntactic constraint specific to >> Apache Cassandra. For instance, in DynamoDB, the DDL for defining a Global >> Secondary Index does not mandate inclusion of the base table's primary key. >> This suggests that the syntax requirement in Cassandra could potentially be >> relaxed in the future (outside the scope of this CEP). As Benedict noted, >> the base table's primary key is optional when querying a materialized view. >> >> Jaydeep >> >> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >> >> > Or compaction hasn’t made a mistake, or cell merge reconciliation >> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file. >> > Repair isnt’ just about “have all transaction commits landed”. It’s “is >> the data correct N days after it’s written”. >> >> Don't forget about restoring from a backup. >> >> Is there a way we could do some sort of hybrid compaction + incremental >> repair? Maybe have the MV verify it's view while it's compacting, and when >> it's done, mark the view's SSTable as repaired? Then the repair process >> would only need to do a MV to MV repair. >> >> Jon >> >> >> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith < >> bened...@apache.org> wrote: >> >> Like something doesn't add up here because if it always includes the base >> table's primary key columns that means they could be storage attached by >> just forbidding additional columns and there doesn't seem to be much >> utility in including additional columns in the primary key? >> >> >> You can re-order the keys, and they only need to be a part of the primary >> key not the partition key. I think you can specify an arbitrary order to >> the keys also, so you can change the effective sort order. So, the basic >> idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)). >> >> This is basically a global index, with the restriction on single columns >> as keys only because we cannot cheaply read-before-write for eventually >> consistent operations. This restriction can easily be relaxed for Paxos and >> Accord based implementations, which can also safely include additional keys. >> >> That said, I am not at all sure why they are called materialised views if >> we don’t support including any other data besides the lookup column and the >> primary key. We should really rename them once they work, both to make some >> sense and to break with the historical baggage. >> >> I think this can be represented as a tombstone which can always be >> fetched from the base table on read or maybe some other arrangement? I >> agree it can't feasibly be represented as an enumeration of the deletions >> at least not synchronously and doing it async has its own problems. >> >> If the base table must be read on read of an index/view, then I think >> this proposal is approximately linearizable for the view as well (though, I >> do not at all warrant this statement). You still need to propagate this >> eventually so that the views can cleanup. This also makes reads 2RT on >> read, which is rather costly. >> >> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> Hi, >> >> I think it's worth taking a step back and looking at the current MV >> restrictions which are pretty onerous. >> >> A view must have a primary key and that primary key must conform to the >> following restrictions: >> >> - it must contain all the primary key columns of the base table. This >> ensures that every row of the view correspond to exactly one row of the >> base table. >> - it can only contain a single column that is not a primary key >> column in the base table. >> >> At that point what exactly is the value in including anything except the >> original primary key in the MV's primary key columns unless you are using >> an ordered partitioner so you can iterate based on the leading primary key >> columns? >> >> Like something doesn't add up here because if it always includes the base >> table's primary key columns that means they could be storage attached by >> just forbidding additional columns and there doesn't seem to be much >> utility in including additional columns in the primary key? >> >> I'm not that clear on how much better it is to look something up in the >> MV vs just looking at the base table or some non-materialized view of it. >> How exactly are these MVs supposed to be used and what value do they >> provide? >> >> Jeff Jirsa wrote: >> >> There’s 2 things in this proposal that give me a lot of pause. >> >> >> Runtian Liu pointed out that the CEP is sort of divided into two parts. >> The first is the online part which is making reads/writes to MVs safer and >> more reliable using a transaction system. The second is offline which is >> repair. >> >> The story for the online portion I think is quite strong and worth >> considering on its own merits. >> >> The offline portion (repair) sounds a little less feasible to run in >> production, but I also think that MVs without any mechanism for checking >> their consistency are not viable to run in production. So it's kind of pay >> for what you use in terms of the feature? >> >> It's definitely worth thinking through if there is a way to fix one side >> of this equation so it works better. >> >> David Capwell wrote: >> >> As far as I can tell, being based off Accord means you don’t need to care >> about repair, as Accord will manage the consistency for you; you can’t get >> out of sync. >> >> I think a baseline requirement in C* for something to be in production is >> to be able to run preview repair and validate that the transaction system >> or any other part of Cassandra hasn't made a mistake. Divergence can have >> many sources including Accord. >> >> Runtian Liu wrote: >> >> For the example David mentioned, LWT cannot support. Since LWTs operate >> on a single token, we’ll need to restrict base-table updates to one >> partition—and ideally one row—at a time. A current MV base-table command >> can delete an entire partition, but doing so might touch hundreds of MV >> partitions, making consistency guarantees impossible. >> >> I think this can be represented as a tombstone which can always be >> fetched from the base table on read or maybe some other arrangement? I >> agree it can't feasibly be represented as an enumeration of the deletions >> at least not synchronously and doing it async has its own problems. >> >> Ariel >> >> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote: >> >> >> >> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> >> I am *big* fan of getting repair really working with MVs. It does seem >> problematic that the number of merkle trees will be equal to the number of >> ranges in the cluster and repair of MVs would become an all node >> operation. How would down nodes be handled and how many nodes would >> simultaneously working to validate a given base table range at once? How >> many base table ranges could simultaneously be repairing MVs? >> >> If a row containing a column that creates an MV partition is deleted, and >> the MV isn't updated, then how does the merkle tree approach propagate the >> deletion to the MV? The CEP says that anti-compaction would remove extra >> rows, but I am not clear on how that works. When is anti-compaction >> performed in the repair process and what is/isn't included in the outputs? >> >> >> >> I thought about these two points last night after I sent my email. >> >> There’s 2 things in this proposal that give me a lot of pause. >> >> One is the lack of tombstones / deletions in the merle trees, which makes >> properly dealing with writes/deletes/inconsistency very hard (afaict) >> >> The second is the reality that repairing a single partition in the base >> table may repair all hosts/ranges in the MV table, and vice versa. >> Basically scanning either base or MV is effectively scanning the whole >> cluster (modulo what you can avoid in the clean/dirty repaired sets). This >> makes me really, really concerned with how it scales, and how likely it is >> to be able to schedule automatically without blowing up. >> >> The paxos vs accord comments so far are interesting in that I think both >> could be made to work, but I am very concerned about how the merkle tree >> comparisons are likely to work with wide partitions leading to massive >> fanout in ranges. >> >> >> >> >> >> >>