Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Runtian Liu Fri, 09 May 2025 19:53:25 -0700

I’ve added a new section on isolation and consistency
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-IsolationandConsistency>.
In our current design, materialized-view tables stay eventually consistent,
while the base table offers linearizability. Here, “strict consistency”
refers to linearizable base-table updates, with every successful write
ensuring that the corresponding MV change is applied and visible.

>Why mandate `LOCAL_QUORUM` instead of using the consistency level
requested by the application? If they want to use `LOCAL_QUORUM` they can
always request it.

I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory
and users should be able to select which consistency to use. Updated the
page for this one.

For the example David mentioned, LWT cannot support. Since LWTs operate on
a single token, we’ll need to restrict base-table updates to one
partition—and ideally one row—at a time. A current MV base-table command
can delete an entire partition, but doing so might touch hundreds of MV
partitions, making consistency guarantees impossible. Limiting each
operation’s scope lets us ensure that every successful base-table write is
accurately propagated to its MV. Even with Accord backed MV, I think we
will need to limit the number of rows that get modified each time.

Regarding repair, due to bugs, operator errors, or hardware faults, MVs can
become out of sync with their base tables—regardless of the chosen
synchronization method during writes. The purpose of MV repair is to detect
and resolve these mismatches using the base table as the source of truth.
As a result, if data resurrection occurs in the base table, the repair
process will propagate that resurrected data to the MV.

>One is the lack of tombstones / deletions in the merle trees, which makes
properly dealing with writes/deletes/inconsistency very hard (afaict)

Tombstones are excluded because a base table update can produce a tombstone
in the MV—for example, when the updated cell is part of the MV's primary
key. Since such tombstones may not exist in the base table, we can only
compare live data during MV repair.

> repairing a single partition in the base table may repair all
hosts/ranges in the MV table,

That’s correct. To avoid repeatedly scanning both tables, the proposed
solution is for all nodes to take a snapshot first. Then, each node scans
the base table once and the MV table once, generating a list of Merkle
trees from each scan. These lists are then compared to identify mismatches.
This means MV repair must be performed at the table level rather than one
token range at a time to be efficient.

>If a row containing a column that creates an MV partition is deleted, and
the MV isn't updated, then how does the merkle tree approach propagate the
deletion to the MV? The CEP says that anti-compaction would remove extra
rows, but I am not clear on how that works. When is anti-compaction
performed in the repair process and what is/isn't included in the outputs?

Let me illustrate this with an example:

We have the following base table and MV:

CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck));CREATE
MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk);

Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ...,
(100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row
(55,55) is deleted from the base table, but due to some issue, it still
exists in the MV.

Let's say each Merkle tree covers 20 rows in both the base and MV tables,
so we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the
repair job detects a mismatch in the range base(40–59) vs MV(40–59).

On the node that owns the MV range (40–59), anti-compaction will be
triggered. If all 100 rows were in a single SSTable, it would be split into
two SSTables: one containing the 20 rows in the (40–59) range, and the
other containing the remaining 80 rows.

On the base table side, the node will scan the (40–59) range, identify all
rows that map to the MV range (40–59)—which in this example would be 19
rows—and stream them to the MV node. Once streaming completes, the MV node
can safely mark the 20-row SSTable as obsolete. In this way, the extra row
in MV is removed.

The core idea is to reconstruct the MV data for base range (40–59) and MV
range (40–59) using the corresponding base table range as the source of
truth.

On Fri, May 9, 2025 at 2:26 PM David Capwell <dcapw...@apple.com> wrote:

> The MV repair tool in Cassandra is intended to address inconsistencies
> that may occur in materialized views due to various factors. This component
> is the most complex and demanding part of the development effort,
> representing roughly 70% of the overall work.
>
>
> but I am very concerned about how the merkle tree comparisons are likely
> to work with wide partitions leading to massive fanout in ranges.
>
>
> As far as I can tell, being based off Accord means you don’t need to care
> about repair, as Accord will manage the consistency for you; you can’t get
> out of sync.
>
> Being based off accord also means you can deal with multiple
> partitions/tokens, where as LWT is limited to a single token.  I am not
> sure how the following would work with the proposed design and LWT
>
> CREATE TABLE tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));
> CREATE MATERIALIZED VIEW tbl2
> AS SELECT * FROM tbl WHERE ck > 42 PRIMARY KEY(pk, ck)
>
> — mutations
> UPDATE tbl SET v=42 WHERE pk IN (0, 1) AND ck IN (50, 74); — this touches
> 2 partition keys
> BEGIN BATCH — also touches 2 partition keys
>   INSERT INTO tbl (pk, ck, v) VALUES (0, 47, 0);
>   INSERT INTO tbl (pk, ck, v) VALUES (1, 48, 0);
> END BATCH
>
>
>
> On May 9, 2025, at 1:03 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>
>
> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>
> I am *big* fan of getting repair really working with MVs. It does seem
> problematic that the number of merkle trees will be equal to the number of
> ranges in the cluster and repair of MVs would become an all node
> operation.  How would down nodes be handled and how many nodes would
> simultaneously working to validate a given base table range at once? How
> many base table ranges could simultaneously be repairing MVs?
>
> If a row containing a column that creates an MV partition is deleted, and
> the MV isn't updated, then how does the merkle tree approach propagate the
> deletion to the MV? The CEP says that anti-compaction would remove extra
> rows, but I am not clear on how that works. When is anti-compaction
> performed in the repair process and what is/isn't included in the outputs?
>
>
>
> I thought about these two points last night after I sent my email.
>
> There’s 2 things in this proposal that give me a lot of pause.
>
> One is the lack of tombstones / deletions in the merle trees, which makes
> properly dealing with writes/deletes/inconsistency very hard (afaict)
>
> The second is the reality that repairing a single partition in the base
> table may repair all hosts/ranges in the MV table, and vice versa.
> Basically scanning either base or MV is effectively scanning the whole
> cluster (modulo what you can avoid in the clean/dirty repaired sets). This
> makes me really, really concerned with how it scales, and how likely it is
> to be able to schedule automatically without blowing up.
>
> The paxos vs accord comments so far are interesting in that I think both
> could be made to work, but I am very concerned about how the merkle tree
> comparisons are likely to work with wide partitions leading to massive
> fanout in ranges.
>
>
>
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to