Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Benedict Wed, 21 May 2025 13:31:03 -0700

It’s an additional piece of work. If you need to be able to rebuild this data, then you need the original proposal either way. This proposal to maintain a live updating snapshot is therefore an additional feature on top of the MVP proposed.

I don’t think this new proposal is fully fleshed out, I have a lot more questions about it than I do about the original proposal. I also don’t think it is healthy to weigh down other contributors’ proposals that move the state of the database forwards with our own design goals, without very strong justification. I think it is fine to say to users: repair exists, it shouldn’t ordinarily need to be run, but if you do it will for now require a snapshotting process. This would be acceptable to me as an operator (workload dependent), and I’m sure it would be acceptable to others, including the contributors undertaking the work.

In the meantime there’s time to sketch out how such an online update process would work in more detail. I think it is achievable but much less obvious than building a snapshot.

On 21 May 2025, at 21:00, Blake Eggleston <bl...@ultrablake.com> wrote:

I don’t think it’s trivial, but I also don’t think it’s any more difficult then adding a mechanism to snapshot and build this merkle tree grid. Remember, we can’t just start a full cluster wide table scan at full blast everytime we want to start a new repair cycle. There’s going to need to be some gradual build coordination.

I also don’t think this belongs in a follow on task. I think that the original proposal is incomplete without a better repair story and I’m not sure I’d support the cep if it proceeded as is.

On Wed, May 21, 2025, at 12:22 PM, Benedict wrote:

Depending how long the grid structure takes to build, there is perhaps anyway value in being able to update the snapshot after construction, so that when the repair is performed it is as up to date as possible. But, I don’t think this is trivial? I have some ideas how this might be done but they aren’t ideal, and could be costly or error prone. Have you sketched out a mechanism for this?

I like the idea that the same mechanism could be used to build a one-off snapshot as maintain a live on, leaving the operator to decide what they prefer.

Since this seems like an extension to the original proposal, I would suggest the original proposal is advanced and live updates to the snapshot is developed in follow up work.

On 21 May 2025, at 17:45, Blake Eggleston <bl...@ultrablake.com> wrote:

1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.

Yes, it is. FWIW I think the grid structure is the right approach conceptually, but it has some drawbacks as proposed that I think we should try to improve. The simple index approach takes the same high level approach with a different set of tradeoffs

2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.

That's true. You could force a rebuild of the marker table if you knew there was a problem, and it might even be a good idea to have a process that slowly rebuilds the table in the background. The important part is that there's a low upfront cost to starting a repair, and that individual range pairs can be repaired quickly (by repair standards).

Another advantage to having a more lightweight repair mechanism is that we can tolerate riskier client patterns. For instance, if we went with a paxos based approach, I think LOCAL_SERIAL writes would become acceptable, whereas I think we'd only be able to allow SERIAL with the heavier repair.

On Tue, May 20, 2025, at 8:37 PM, Jaydeep Chovatia wrote:
1. Isn't this hybrid approach conceptually similar to the grid structure described in the proposal? The main distinction is that the original proposal involves recomputing the entire grid during each repair cycle. In contrast, the approach outlined below optimizes this by only reconstructing individual cells marked as dirty due to recent updates.
2. If we adopt the dirty marker approach, it may not account for cases where there were no user writes, but inconsistencies still arose between the base and the view—such as those caused by SSTable bit rot, streaming anomalies, or other low-level issues.
<image.png>

Jaydeep

On Tue, May 20, 2025 at 5:31 PM Blake Eggleston <bl...@ultrablake.com> wrote:

I had an idea that’s a kind of a hybrid between the index approach and the merkle tree approach. Basically we keep something kind of like an index that only contains a hash of the data between a base partition and view partition intersection. So it would structure data like this:

view_range -> base_token -> view_token -> contents_hash

Both the base and view would maintain identical structures, and instead of trying to keep the hash always up to date with the data on disk, like an index, we would just mark base/view token combos as dirty when we get a write to a given base/view token combo. When we do a repair on a base/view range intersection, we recompute the content hashes for any dirty entries. Possibly with a background job that makes sure we don’t accumulate too many dirty entries if repair isn’t running often or something.

So that’s 3 longs for each base/view intersection, comparable via sequential reads, and would allow us to quickly detect any inconsistencies between the base and view.

On Tue, May 20, 2025, at 2:33 PM, Jaydeep Chovatia wrote:
>* Consistency question: In the case where a base table gets a corrupt SSTable and is scrubbed, when it repairs against the view, without tracking the deletes against the secondary table, do we end up pushing the lack of data into the MV?
>I think we'd still need to combine the output from the other replicas so that doesn't happen.

SSTable corruption is one potential issue, but there are additional scenarios to consider—such as a node missing data. For this reason, the example in the proposal includes all replicas when performing materialized view (MV) repair, as a precaution to ensure better Base<->MV consistency. Here's a snippet...

<image.png>

Jaydeep

On Tue, May 20, 2025 at 1:55 PM Blake Eggleston <bl...@ultrablake.com> wrote:

* Consistency question: In the case where a base table gets a corrupt SSTable and is scrubbed, when it repairs against the view, without tracking the deletes against the secondary table, do we end up pushing the lack of data into the MV?

I think we'd still need to combine the output from the other replicas so that doesn't happen.

* I threw out the idea earlier in the thread that we could track tombstones in something attached to the SSTable (tombstone index?). I'm curious what you all think about this. Without it, we don't have a way of knowing what's a delete and what's missing. With it, we simply record all the deletes that happened in the MV as a consequence of base table updates, and we store it as an SSTable component.

The issue with tracking tombstones is that you're not guaranteed to see all of them if they overlap. So you may over repair, though that's not necessarily a bad thing.

* Is there a case where with Accord, 2 transactions can get the same timestamp? If the MVs are managed with Accord (or Paxos), can we just rely on a calculation from the most recent cell's timestamps to determine if we've got missed writes?

Within a single partition, timestamps should always be unique and incrementing. I think that would work on a per-column level.

* With the two SSTable components, a tombstone index & a MV index, (if all my assumptions above are correct) I think we should have all the data we need to detect inconsistencies.

I think we can detect inconsistencies with just an index.

* Seems like we should be repairing against the base table before we repair the MV, or at least do in conjunction when we repair a range segment.

Yeah that seems reasonable

On Tue, May 20, 2025, at 10:28 AM, Jon Haddad wrote:
More questions and thoughts...

* Consistency question: In the case where a base table gets a corrupt SSTable and is scrubbed, when it repairs against the view, without tracking the deletes against the secondary table, do we end up pushing the lack of data into the MV?

* I threw out the idea earlier in the thread that we could track tombstones in something attached to the SSTable (tombstone index?). I'm curious what you all think about this. Without it, we don't have a way of knowing what's a delete and what's missing. With it, we simply record all the deletes that happened in the MV as a consequence of base table updates, and we store it as an SSTable component.

* Is there a case where with Accord, 2 transactions can get the same timestamp? If the MVs are managed with Accord (or Paxos), can we just rely on a calculation from the most recent cell's timestamps to determine if we've got missed writes?

* If the above is true, then could the index simply be the view's orientation of the data + the timestamp of the last cell write? That should compress exceptionally well especially if we are using tries. Since the cells would be written and merged in the MV's partition order, we could still create a Merkle tree and it could maintain all of it's current insertion properties, we'd just have a different scheme for generating the hashes - just using cell's timestamps.

* With the two SSTable components, a tombstone index & a MV index, (if all my assumptions above are correct) I think we should have all the data we need to detect inconsistencies.

* Seems like we should be repairing against the base table before we repair the MV, or at least do in conjunction when we repair a range segment.

If we can do the 2 components it should significant cut down on repair time and give us better consistency. The downside is that it doesn't give us a good path to fix existing tables.

Thoughts?

Jon

On Mon, May 19, 2025 at 4:20 PM Blake Eggleston <bl...@ultrablake.com> wrote:

Right, we can’t literally use merkle trees. What I mean is that it’s worth looking into alternate index schemes that could work for our use case though. There are a lot of encoding schemes out there designed to detect errors. Even something probabalistic that caused us to over-repair by some amount might be ok.

On Mon, May 19, 2025, at 2:15 PM, Runtian Liu wrote:
> You don’t need to duplicate the full data set in the index, you just need enough info to detect that something is missing.
Could you please explain how this would work?
If we build Merkle trees or compute hashes at the SSTable level, how would this case be handled?
For example, consider the following table schema:
CREATE TABLE (
pk int PRIMARY KEY,
v1 int,
v2 int,
v3 int
);

Suppose the data is stored as follows:
Node 1

SSTable1: (1, 1, null, 1)
SSTable2: (1, null, 1, 1)
Node 2
SSTable3: (1, 1, 1, 1)
How can we ensure that a hash or Merkle tree computed at the SSTable level would produce the same result on both nodes for this row?

On Mon, May 19, 2025 at 1:54 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
We could also track the deletes that need to be made to the view, in another SSTable component on the base. That way you can actually do repair with tombstones.

On Mon, May 19, 2025 at 11:37 AM Blake Eggleston <bl...@ultrablake.com> wrote:

If we went the storage attached route then I think you’d just need more memory for the memtable, compaction would just be combining 2 sorted sets, though there would probably be some additional work related to deletes, overwrites, and tombstone purging.

Regarding the size of the index, I think Jon was on the right track with his sstable attached merkle tree idea. You don’t need to duplicate the full data set in the index, you just need enough info to detect that something is missing. If you can detect that view partition x is missing data from base partition y, then you could start comparing the actual partition data and figure out who’s missing what.

On Sun, May 18, 2025, at 9:20 PM, Runtian Liu wrote:
> If you had a custom SAI index or something, this isn’t something you’d need to worry about
This is what I missed.
I think this could be a potential solution, but comparing indexes alone isn’t sufficient—it only handles cases where the MV has extra or missing rows. It doesn’t catch data mismatches for rows that exist in both the base table and MV. To address that, we may need to extend SAI for MV to store the entire selected dataset in the index file, applying the same approach to MV as we do for the base table. This would increase storage to roughly 4x per MV, compared to the current 2x, but it would help avoid random disk access during repair. I’m not sure if this would introduce any memory issues during compaction.

On Sun, May 18, 2025 at 8:09 PM Blake Eggleston <bl...@ultrablake.com> wrote:

It might be more efficient, but it’s also more brittle. I think it would be more fault tolerant and less trouble overall to repair intersecting token ranges. So you’re not repairing a view partition, you’re repairing the parts of a view partition that intersect with a base table token range.

The issues I see with the global snapshot are:

1. Requiring a global snapshot means that you can’t start a new repair cycle if there’s a node down.
2. These merkle trees can’t all be calculated at once, so we’ll need a coordination mechanism to spread out scans of the snapshots
3. By requiring a global snapshot and then building merkle trees from that snapshot, you’re introducing a delay of however long it takes you to do a full scan of both tables. So if you’re repairing your cluster every 3 days, it means the last range to get repaired is repairing based on a state that’s now 3 days old. This makes your repair horizon 2x your scheduling cadence and puts an upper bound on how up to date you can keep your view.

With an index based approach, much of the work is just built into the write and compaction paths and repair is just a scan of the intersecting index segments from the base and view tables. You’re also repairing from the state that existed when you started your repair, so your repair horizon matches your scheduling cadence.

On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>Isn’t the reality here is that repairing a single partition in the base table is potentially a full cluster-wide scan of the MV if you also want to detect rows in the MV that don’t exist in the base table (eg resurrection or a missed delete)
Exactly. Since materialized views (MVs) are partitioned differently from their base tables, there doesn’t appear to be a more efficient way to repair them in a targeted manner—meaning we can’t restrict the repair to only a small portion of the data.

Jaydeep

On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa <jji...@gmail.com> wrote:

Isn’t the reality here is that repairing a single partition in the base table is potentially a full cluster-wide scan of the MV if you also want to detect rows in the MV that don’t exist in the base table (eg resurrection or a missed delete)

There’s no getting around that. Keeping an extra index doesn’t avoid that scan, it just moves the problem around to another tier.

On May 18, 2025, at 4:59 PM, Blake Eggleston <bl...@ultrablake.com> wrote:

Whether it’s index based repair or another mechanism, I think the proposed repair design needs to be refined. The requirement of a global snapshot and merkle tree build before we can start detecting and fixing problems is a pretty big limitation.

> Data scans during repair would become random disk accesses instead of sequential ones, which can degrade performance.

You’d only be reading and comparing the index files, not the sstable contents. Reads would still be sequential.

> Most importantly, I decided against this approach due to the complexity of ensuring index consistency. Introducing secondary indexes opens up new challenges, such as keeping them in sync with the actual data.

I think this is mostly a solved problem in C*? If you had a custom SAI index or something, this isn’t something you’d need to worry about AFAIK.

On Sat, May 17, 2025, at 4:57 PM, Runtian Liu wrote:
> I think you could exploit this to improve your MV repair design. Instead of taking global snapshots and persisting merkle trees, you could implement a set of secondary indexes on the base and view tables that you could quickly compare the contents of for repair.
We actually considered this approach while designing the MV repair. However, there are several downsides:
It requires additional storage for the index files.
Data scans during repair would become random disk accesses instead of sequential ones, which can degrade performance.
Most importantly, I decided against this approach due to the complexity of ensuring index consistency. Introducing secondary indexes opens up new challenges, such as keeping them in sync with the actual data.
The goal of the design is to provide a catch-all mismatch detection mechanism that targets the dataset users query during the online path. I did consider adding indexes at the SSTable level to guarantee consistency between indexes and data.
> sorted by base table partition order, but segmented by view partition ranges
If the indexes at the SSTable level, it means it will be less flexible, we need to rewrite the SSTables if we decide to range the view partition ranges.
I didn’t explore this direction further due to the issues listed above.

> The transformative repair could be done against the local index, and the local index can repair against the global index. It opens up a lot of possibilities, query wise, as well.
This is something I’m not entirely sure about—how exactly do we use the local index to support the global index (i.e., the MV)? If the MV relies on local indexes during the query path, we can definitely dig deeper into how repair could work with that design.

The proposed design in this CEP aims to treat the base table and its MV like any other regular tables, so that operations such as compaction and repair can be handled in the same way in most cases.

On Sat, May 17, 2025 at 2:42 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
Yeah, this is exactly what i suggested in a different part of the thread. The transformative repair could be done against the local index, and the local index can repair against the global index. It opens up a lot of possibilities, query wise, as well.

On Sat, May 17, 2025 at 1:47 PM Blake Eggleston <bl...@ultrablake.com> wrote:

> They are not two unordered sets, but rather two sets ordered by different keys.

I think you could exploit this to improve your MV repair design. Instead of taking global snapshots and persisting merkle trees, you could implement a set of secondary indexes on the base and view tables that you could quickly compare the contents of for repair.

The indexes would have their contents sorted by base table partition order, but segmented by view partition ranges. Then any view <-> base repair would compare the intersecting index slices. That would allow you to repair data more quickly and with less operational complexity.

On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:

For example, in the chart above, each cell represents a Merkle tree that covers data belonging to a specific base table range and a specific MV range. When we scan a base table range, we can generate the Merkle trees marked in red. When we scan an MV range, we can generate the Merkle trees marked in green. The cells that can be compared are marked in blue.
To save time and CPU resources, we persist the Merkle trees created during a scan so we don’t need to regenerate them later. This way, when other nodes scan and build Merkle trees based on the same “frozen” snapshot, we can reuse the existing Merkle trees for comparison.

On Fri, May 16, 2025 at 12:22 PM Runtian Liu <curly...@gmail.com> wrote:
Unfortunately, no. When building Merkle trees for small token ranges in the base table, those ranges may span the entire MV token range. As a result, we need to scan the entire MV to generate all the necessary Merkle trees. For efficiency, we perform this as a single pass over the entire table rather than scanning a small range of the base or MV table individually. As you mentioned, with storage becoming increasingly affordable, this approach helps us save time and CPU resources.

On Fri, May 16, 2025 at 12:11 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
I spoke too soon - endless questions are not over :)

Since the data that's going to be repaired only covers a range, I wonder if it makes sense to have the ability to issue a minimalist snapshot that only hardlinks SSTables that are in a token range. Based on what you (Runtian) have said above, only a small percentage of the data would actually be repaired at any given time.

Just a thought to save a little filesystem churn.

On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com> wrote:
Nevermind about the height thing i guess its the same property.

I’m done for now :)

Thanks for entertaining my endless questions. My biggest concerns about repair have been alleviated.

Jon

On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com> wrote:
Thats the critical bit i was missing, thank you Blake.

I guess we’d need to have unlimited height trees then, since you’d need to be able to update the hashes of individual partitions, and we’d also need to propagate the hashes up every time as well. I’m curious what the cost will look like with that.

At least it’s a cpu problem not an I/O one.

Jon

On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com> wrote:

The merkle tree xor's the individual row hashes together, which is commutative. So you should be able to build a tree in the view token order while reading in base table token order and vise versa.

On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
Thanks for the explanation, I appreciate it. I think you might still be glossing over an important point - which I'll make singularly here. There's a number of things I'm concerned about, but this is a big one.

Calculating the hash of a partition for a Merkle tree needs to be done on the fully materialized, sorted partition.

The examples you're giving are simple, to the point where they hide the problem. Here's a better example, where the MV has a clustering column. In the MV's partition it'll have multiple rows, but in the base table it'll be stored in different pages or different SSTables entirely:

CREATE TABLE test.t1 (
id int PRIMARY KEY,
v1 int
);

CREATE MATERIALIZED VIEW test.test_mv AS
SELECT v1, id
FROM test.t1
WHERE id IS NOT NULL AND v1 IS NOT NULL
PRIMARY KEY (v1, id)
WITH CLUSTERING ORDER BY (id ASC);

Let's say we have some test data:

cqlsh:test> select id, v1 from t1;

id | v1
----+----
10 | 11
1 | 14
19 | 10
2 | 14
3 | 14

When we transform the data by iterating over the base table, we get this representation (note v1=14):

cqlsh:test> select v1, id from t1;

v1 | id
----+----
11 | 10
14 | 1   <------
10 | 19
14 | 2 <------
14 | 3 <------

The partiton key in the new table is v1. If you simply iterate and transform and calculate merkle trees on the fly, you'll hit v1=14 with id=1, but you'll miss id=2 and id=3. You need to get them all up front, and in sorted order, before you calculate the hash. You actually need to transform the data to this, prior to calculating the tree:

v1 | id
----+----
11 | 10
14 | 1, 2, 3
10 | 19

Without an index you need to do one of the following over a dataset that's hundreds of GB:

* for each partition, scan the entire range for all the data, then sort that partition in memory, then calculate the hash
* collect the entire dataset in memory, transform and sort it
* use a local index which has the keys already sorted

A similar problem exists when trying to resolve the mismatches.

Unless I'm missing some critical detail, I can't see how this will work without requiring nodes have hundreds of GB of RAM or we do several orders of magnitude more I/O than a normal repair.

Jon

On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com> wrote:
Thank you for the thoughtful questions, Jon. I really appreciate them—let me go through them one by one.
* Do you intend on building all the Merkle trees in parallel?
Since we take a snapshot to "freeze" the dataset, we don’t need to build all Merkle trees in parallel.

* Will there be hundreds of files doing random IO to persist the trees to disk, in addition to the sequential IO from repair?
The Merkle tree will only be persisted after the entire range scan is complete.

* Is the intention of persisting the trees to disk to recover from failure, or just to limit memory usage?
This is primarily to limit memory usage. As you may have noticed, MV repair needs to coordinate across the entire cluster rather than just a few nodes. This process may take very long time and it may node may restart or do other operations during the time.

* Have you calculated the Merkle tree space requirements?
This is a very good question—I'll add it to the CEP as well. Each leaf node stores a 32-byte hash. With a tree depth of 15 (which is on the higher end—smaller datasets might use fewer than 10 levels), a single Merkle tree would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into 10 ranges per node, we’ll end up with around 100 Merkle trees per node, totaling roughly 100 MB.
* When do we build the Merkle trees for the view? Is that happening in parallel with the base table? Do we have the computational complexity of 2 full cluster repairs running simultaneously, or does it take twice as long?
As mentioned earlier, this can be done in parallel with the base table or after building the base table’s Merkle tree, since we’re using a snapshot to “freeze” the data.
> I'm very curious to hear if anyone has run a full cluster repair recently on a non-trivial dataset. Every cluster I work with only does subrange repair. I can't even recall the last time I did a full repair on a large cluster. I may never have, now that I think about it. Every time I've done this in the past it's been plagued with issues, both in terms of performance and reliability. Subrange repair works because it can make progress in 15-30 minute increments.
When we run a full repair, we trigger subrange repair on one node, then proceed to the next subrange, and continue this way until the node's entire primary range is repaired. After that, we move on to the next node—correct? The complexity comparison between full repair and the proposed MV repair is meant to compare the cost of repairing the entire dataset, not just a subrange.

For the example you mentioned, let me explain how it works using the schema—without needing to create an index to build the Merkle trees.
Suppose we have a node that owns the token range 1–30, and we have a few records in the base table and its corresponding MV:
Base table: (1, 1), (2, 11), (12, 1), (23, 1)
MV: (1, 1), (1, 12), (1, 23), (2, 11)
When we run a full repair, we divide the node’s range into subranges of size 10, we have r=3 ranges in total.
First, we repair the range (1–10). The records (1, 1) and (2, 11) fall into this range and are used to build the first Merkle tree, which is then compared with the corresponding tree from another replica.
Next, we repair the range (11–20). Here, the record (12, 1) is used to build the second Merkle tree.
Finally, we repair the range (21–30), using the record (23, 1) to build the third Merkle tree, which is again compared with a replica's version.
In MV repair, we still use a subrange size of 10. The key difference is that each Merkle tree is responsible for data not just based on the base table's partition key, but also on the MV's partition key.
For example, when scanning the base table over the range (1–10):
In full repair, we generate one Merkle tree for that subrange.
In MV repair, we generate r = 3 Merkle trees, one for each MV partition key range.

This means the record (1, 1) will go into the first tree because the MV partition key is 1, while (2, 11) will go into the second tree because its MV key is 11. The third tree will be empty because there is no record with base table key in (1-10) and MV key in (20-30).
After scanning the base table range (1–10), we proceed to the next range, (11–20), and again generate 3 Merkle trees, followed by the last range. This is why the total number of Merkle trees is r²—in this case, 9 trees need to be built for the entire table.
A similar idea applies when scanning the MV to build Merkle trees. Essentially, for MV repair, each Merkle tree represents two-dimensional data, unlike normal repair where it only represents one dimension. Each Merkle tree represents the data that maps to Range(x) in the base table and Range(y) in the MV.

In full repair, tokens must be sorted when adding to the Merkle tree because the tree is built from the leaves—records are added sequentially from left to right.
For MV repair, since the leaf nodes are sorted by the MV partition key, a base table row can be inserted into any leaf node. This means we must insert each hash starting from the root instead of directly at the leaf.
As noted in the comparison table, this increases complexity:

In full repair, Merkle tree building is O(1) per row—each hash is added sequentially to the leaf nodes.
In MV repair, each hash must be inserted from the root, making it O(d) per row.

Since d (the tree depth) is typically small—less than 20 and often smaller than in full repair—this added complexity isn’t a major concern in practice. The reason it is smaller than full repair is that, with the above example, we use 3 trees to represent the same amount of data while full repair uses 1 tree.

Note that within each leaf node, the order in which hashes are added doesn’t matter. Cassandra repair currently enforces sorted input only to ensure that leaf nodes are built from left to right.
> So let's say we find a mismatch in a hash. That indicates that there's some range of data where we have an issue. For some token range calculated from the v1 field, we have a mismatch, right? What do we do with that information?
With the above example being said, when we identify a range mismatch, it means we’ve found that data within the base table primary key range (a–b) and MV primary key range (m–n) has inconsistencies. We only need to rebuild this specific data.
This allows us to easily locate the base table node that owns range (a–b) and rebuild only the affected MV partition key range (m–n).

* Will there be coordination between all nodes in the cluster to ensure you don't have to do multiple scans?

Yes, coordination is important for this type of repair. With the proposed solution, we can detect mismatches between the base table and the MV by scanning data from each of them just once.
However, this doesn't mean all nodes need to be healthy during the repair. You can think of all the Merkle trees as forming a 2D matrix—if one node is down, it corresponds to one row and one column being unavailable for comparison. The remaining cells can still be used for mismatch detection.

Please don’t hesitate to let me know if anything is unclear or if you have any further questions or concerns—I’d be happy to discuss them.

Thanks,
Runtian

On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
One last thing. I'm pretty sure building the tree requires the keys be added in token order: https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173

Which definitely introduces a bit of a problem, given that the tree would be constructed from the transformed v1, which is a value unpredictable enough to be considered random.

The only way I can think of to address this would be to maintain a local index on v1. See my previous email where I mentioned this.

Base Table -> Local Index -> Global Index

Still a really hard problem.

Jon

On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
There's a lot here that's still confusing to me. Maybe you can help me understand it better? Apologies in advance for the text wall :)

I'll use this schema as an example:

---------
CREATE TABLE test.t1 (
id int PRIMARY KEY,
v1 int
);

create MATERIALIZED VIEW test_mv as
SELECT v1, id from test.t1 where id is not null and v1 is not null primary key (v1, id);
---------

We've got (id, v1) in the base table and (v1, id) in the MV.

During the repair, we snapshot, and construct a whole bunch of merkle trees. CEP-48 says they will be persisted to disk.

* Do you intend on building all the Merkle trees in parallel?
* Will there be hundreds of files doing random IO to persist the trees to disk, in addition to the sequential IO from repair?
* Is the intention of persisting the trees to disk to recover from failure, or just to limit memory usage?
* Have you calculated the Merkle tree space requirements?
* When do we build the Merkle trees for the view? Is that happening in parallel with the base table? Do we have the computational complexity of 2 full cluster repairs running simultaneously, or does it take twice as long?

I'm very curious to hear if anyone has run a full cluster repair recently on a non-trivial dataset. Every cluster I work with only does subrange repair. I can't even recall the last time I did a full repair on a large cluster. I may never have, now that I think about it. Every time I've done this in the past it's been plagued with issues, both in terms of performance and reliability. Subrange repair works because it can make progress in 15-30 minute increments.

Anyways - moving on...

You suggest we read the base table and construct the Merkle trees based on the transformed rows. Using my schema above, we take the v1 field and use token(v1), to build the tree. Assuming that a value for v1 appears many times throughout the dataset across many partitions, how do you intend on calculating it's hash? If you look at Validator.rowHash [1] and Validator.add, you'll see it's taking an UnfilteredRowIterator for an entire partition and calculates the hash based on that. Here's the comment:

/**
* Called (in order) for every row present in the CF.
* Hashes the row, and adds it to the tree being built.
*
* @param partition Partition to add hash
*/
public void add(UnfilteredRowIterator partition)

So it seems to me like you need to have the entire partition materialized in memory before adding to the tree.    Doing that per value v1 without an index is pretty much impossible - we'd have to scan the entire dataset once per partition to pull out all the matching v1 values, or you'd need to materialize the entire dataset into a local version of the MV for that range. I don't know how you could do this. Do you have a workaround for this planned? Maybe someone that knows the Merkle tree code better can chime in.

Maybe there's something else here I'm not aware of - please let me know what I'm missing here if I am, it would be great to see this in the doc if you have a solution.

For the sake of discussion, let's assume we've moved past this and we have our tree for a hundreds of ranges built from the base table & built for the MV, now we move onto the comparison.

In the doc at this point, we delete the snapshot because we have the tree structures and we compare Merkle trees. Then we stream mismatched data.

So let's say we find a mismatch in a hash. That indicates that there's some range of data where we have an issue. For some token range calculated from the v1 field, we have a mismatch, right? What do we do with that information?

* Do we tell the node that owned the base table - hey, stream the data from base where token(v1) is in range [X,Y) to me?
* That means we have to scan through the base again for all rows where token(v1) in [X,Y) range, right? Because without an index on the hashes of v1, we're doing a full table scan and hashing every v1 value to find out if it needs to be streamed back to the MV.
* Are we doing this concurrently on all nodes?
* Will there be coordination between all nodes in the cluster to ensure you don't have to do multiple scans?

I realized there's a lot of questions here, but unfortunately I'm having a hard time seeing how we can workaround some of the core assumptions around constructing Merkle trees and using them to resolve the differences in a way that matches up with what's in the doc. I have quite a few more things to discuss, but I'll save them for a follow up once all these have been sorted out.

Thanks in advance!
Jon

[1] https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209

On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote:
The previous table compared the complexity of full repair and MV repair when reconciling one dataset with another. In production, we typically use a replication factor of 3 in one datacenter. This means full repair involves 3n rows, while MV repair involves comparing 6n rows (base + MV). Below is an updated comparison table reflecting this scenario.
n: number of rows to repair (Total rows in the table)
d: depth of one Merkle tree for MV repair
r: number of split ranges
p: data compacted away

This comparison focuses on the complexities of one round of full repair with a replication factor of 3 versus repairing a single MV based on one base table with replication factor 3.

Full Repair
MV Repair
Comment
Extra disk used
0
O(2*p)
Since we take a snapshot at the beginning of the repair, any disk space that would normally be freed by compaction will remain occupied until the Merkle trees are successfully built and the snapshot is cleared.
Data scan complexity
O(3*n)
O(6*n)
Full repair scans n rows from the primary and 2n from replicas.3
MV repair scans 3n rows from the base table and 3n from the MV.
Merkle Tree building time complexity
O(3n)
O(6*n*d)
In full repair, Merkle tree building is O(1) per row—each hash is added sequentially to the leaf nodes.
In MV repair, each hash is inserted from the root, making it O(d) per row. Since d is typically small (less than 20 and often smaller than in full repair), this isn’t a major concern.
Total Merkle tree count
O(3*r)
O(6*r^2)
MV repair needs to generate more, smaller Merkle trees, but this isn’t a concern as they can be persisted to disk during the repair process.
Merkle tree comparison complexity
O(3n)
O(3n)
Assuming one row maps to one leaf node, both repairs are equivalent.
Stream time complexity

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to