[Committer/reviewer needed] Request for Review of Cassandra PRs

2025-01-16 Thread Runtian Liu
Hi all,

I hope you are doing well. I would like to request your review of two pull
requests I've submitted earlier:

   1. *Inbound stream throttler* [CASSANDRA−
   11303
   ]
   2. *Clear existing data when reset_bootstrap_progress is true* [CASSANDRA
   − 20097
   ]

I'd really appreciate any thoughts you have on these changes. If you see
anything that needs tweaking, just let me know, and I'll jump right on it!

Thanks,

Runtian


Re: Welcome Jaydeepkumar Chovatia as Cassandra committer

2025-04-30 Thread Runtian Liu
congratulations!

On Wed, Apr 30, 2025 at 9:09 AM Francisco Guerrero 
wrote:

> Congrats, Jaydeep!
>
> On 2025/04/30 11:43:00 Josh McKenzie wrote:
> > Hey there Cassandra Devs!
> >
> > The Apache Cassandra PMC is very happy to announce that Jaydeep Chovatia
> has
> > accepted the invitation to become a committer!
> >
> > Jaydeep has been busy on Cassandra for a good while now, most recently
> spearheading the contribution of automated repair scheduling via CEP-37 <
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution>
> and CASSANDRA-19918  >.
> >
> > Please join us in congratulating and welcoming Jaydeep.
> >
> > Josh McKenzie on behalf of the Apache Cassandra PMC
>


Re: CEP permission

2025-04-29 Thread Runtian Liu
Awesome, thank you so much!

On Tue, Apr 29, 2025 at 9:14 AM C. Scott Andreas 
wrote:

> Hi Runtian, thanks for reaching out.
>
> Your username should now have permission to create new wiki pages for CEPs
> in ASF Confluence.
>
> Cheers,
>
> – Scott
>
> On Apr 29, 2025, at 9:04 AM, Runtian Liu  wrote:
>
>
> Hi,
>
> Could you please grant me permission to create a new CEP? My ID is
> curlylrt. Thank you!
> Thanks,
> Runtian
>
>
>
>


CEP permission

2025-04-29 Thread Runtian Liu
Hi,

Could you please grant me permission to create a new CEP? My ID is
curlylrt. Thank you!

Thanks,
Runtian


Re: [Committer/reviewer needed] Request for Review of Cassandra PRs

2025-04-21 Thread Runtian Liu
Hi,
A gentle reminder on the following two changes:

   1. *Inbound stream throttler* [CASSANDRA−
   <https://issues.apache.org/jira/browse/CASSANDRA-11303>11303
   <https://issues.apache.org/jira/browse/CASSANDRA-11303>]
   2. *Clear existing data when reset_bootstrap_progress is true* [
   CASSANDRA− <https://issues.apache.org/jira/browse/CASSANDRA-20097>20097
   <https://issues.apache.org/jira/browse/CASSANDRA-20097>]

I could use some help reviewing the inbound stream throttler patch. This
change is particularly useful for deployments using vnodes, where node
replacements often involve streaming from multiple sources.
The second change has already received a +1. Could we please get another
committer to review it?

Thanks,
Runtian


On Thu, Jan 16, 2025 at 10:14 PM guo Maxwell  wrote:

> I have added you as a reviewer of CASSANDRA-11303, you can do that for
> CASSANDRA− <https://issues.apache.org/jira/browse/CASSANDRA-20097>20097
> <https://issues.apache.org/jira/browse/CASSANDRA-20097> by yourself.
>
> guo Maxwell  于2025年1月17日周五 14:10写道:
>
>> Hi Wang,
>> I think Runtian should change the ticket 's status ,submit a pr , then
>> you can add yourself as a reviewer.
>>
>>
>> Cheng Wang via dev  于2025年1月17日周五 14:05写道:
>>
>>> I can review the CASSANDRA-19248 first since it looks straight forward.
>>> For CASSANDRA−11303 and CASSANDRA−
>>> <https://issues.apache.org/jira/browse/CASSANDRA-20097>20097
>>> <https://issues.apache.org/jira/browse/CASSANDRA-20097> I will try but
>>> can't promise anything yet. How to add myself as a reviewer?
>>>
>>> On Thu, Jan 16, 2025 at 7:33 PM Paulo Motta  wrote:
>>>
>>>> Thanks for this message Runtian.
>>>>
>>>> I realized I was previously assigned as reviewer for CASSANDRA−11303
>>>> but I will not be able to review it at this time.
>>>>
>>>> There's also CASSANDRA-19248 that is waiting for review in case someone
>>>> has cycles to review it.
>>>>
>>>> Reviewing community patches is a great way to learn more about the
>>>> codebase and get involved with the project, even if you are not a
>>>> committer. Any contributor can help by doing a first-pass review to triage
>>>> the issue, run tests and check suggested items of the review checklist [1].
>>>>
>>>> Paulo
>>>>
>>>> [1] - https://cassandra.apache.org/_/development/how_to_review.html
>>>>
>>>> On Thu, Jan 16, 2025 at 7:34 PM Runtian Liu  wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I hope you are doing well. I would like to request your review of two
>>>>> pull requests I've submitted earlier:
>>>>>
>>>>>1. *Inbound stream throttler* [CASSANDRA−
>>>>><https://issues.apache.org/jira/browse/CASSANDRA-11303>11303
>>>>><https://issues.apache.org/jira/browse/CASSANDRA-11303>]
>>>>>2. *Clear existing data when reset_bootstrap_progress is true* [CA
>>>>>SSANDRA− <https://issues.apache.org/jira/browse/CASSANDRA-20097>
>>>>>20097 <https://issues.apache.org/jira/browse/CASSANDRA-20097>]
>>>>>
>>>>> I'd really appreciate any thoughts you have on these changes. If you
>>>>> see anything that needs tweaking, just let me know, and I'll jump right on
>>>>> it!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Runtian
>>>>>
>>>>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-09 Thread Runtian Liu
I’ve added a new section on isolation and consistency
.
In our current design, materialized-view tables stay eventually consistent,
while the base table offers linearizability. Here, “strict consistency”
refers to linearizable base-table updates, with every successful write
ensuring that the corresponding MV change is applied and visible.

>Why mandate `LOCAL_QUORUM` instead of using the consistency level
requested by the application? If they want to use `LOCAL_QUORUM` they can
always request it.

I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory
and users should be able to select which consistency to use. Updated the
page for this one.

For the example David mentioned, LWT cannot support. Since LWTs operate on
a single token, we’ll need to restrict base-table updates to one
partition—and ideally one row—at a time. A current MV base-table command
can delete an entire partition, but doing so might touch hundreds of MV
partitions, making consistency guarantees impossible. Limiting each
operation’s scope lets us ensure that every successful base-table write is
accurately propagated to its MV. Even with Accord backed MV, I think we
will need to limit the number of rows that get modified each time.

Regarding repair, due to bugs, operator errors, or hardware faults, MVs can
become out of sync with their base tables—regardless of the chosen
synchronization method during writes. The purpose of MV repair is to detect
and resolve these mismatches using the base table as the source of truth.
As a result, if data resurrection occurs in the base table, the repair
process will propagate that resurrected data to the MV.

>One is the lack of tombstones / deletions in the merle trees, which makes
properly dealing with writes/deletes/inconsistency very hard (afaict)

Tombstones are excluded because a base table update can produce a tombstone
in the MV—for example, when the updated cell is part of the MV's primary
key. Since such tombstones may not exist in the base table, we can only
compare live data during MV repair.

> repairing a single partition in the base table may repair all
hosts/ranges in the MV table,

That’s correct. To avoid repeatedly scanning both tables, the proposed
solution is for all nodes to take a snapshot first. Then, each node scans
the base table once and the MV table once, generating a list of Merkle
trees from each scan. These lists are then compared to identify mismatches.
This means MV repair must be performed at the table level rather than one
token range at a time to be efficient.

>If a row containing a column that creates an MV partition is deleted, and
the MV isn't updated, then how does the merkle tree approach propagate the
deletion to the MV? The CEP says that anti-compaction would remove extra
rows, but I am not clear on how that works. When is anti-compaction
performed in the repair process and what is/isn't included in the outputs?

Let me illustrate this with an example:

We have the following base table and MV:

CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck));CREATE
MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk);

Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ...,
(100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row
(55,55) is deleted from the base table, but due to some issue, it still
exists in the MV.

Let's say each Merkle tree covers 20 rows in both the base and MV tables,
so we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the
repair job detects a mismatch in the range base(40–59) vs MV(40–59).

On the node that owns the MV range (40–59), anti-compaction will be
triggered. If all 100 rows were in a single SSTable, it would be split into
two SSTables: one containing the 20 rows in the (40–59) range, and the
other containing the remaining 80 rows.

On the base table side, the node will scan the (40–59) range, identify all
rows that map to the MV range (40–59)—which in this example would be 19
rows—and stream them to the MV node. Once streaming completes, the MV node
can safely mark the 20-row SSTable as obsolete. In this way, the extra row
in MV is removed.

The core idea is to reconstruct the MV data for base range (40–59) and MV
range (40–59) using the corresponding base table range as the source of
truth.


On Fri, May 9, 2025 at 2:26 PM David Capwell  wrote:

> The MV repair tool in Cassandra is intended to address inconsistencies
> that may occur in materialized views due to various factors. This component
> is the most complex and demanding part of the development effort,
> representing roughly 70% of the overall work.
>
>
> but I am very concerned about how the merkle tree comparisons are likely
> to work with wide partitions leading to massive fanout in ranges.
>
>
> As far as I can tell, bein

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-08 Thread Runtian Liu
Here’s my perspective:

#1 Accord vs. LWT round trips

Based on the insights shared by the Accord experts, it appears that
implementing MV using Accord can achieve a comparable number of round trips
as the LWT solution proposed in CEP-48. Additionally, it seems that the
number of WAN RTTs might be fewer than the LWT solution through Accord.
This suggests that Accord is either equivalent or better in terms of
performance for CEP-48.

Given this, it seems appropriate to set aside performance as a deciding
factor when evaluating LWT versus Accord. I've also updated the CEP-48 page
to reflect this clarification.

#2 Accord vs. LWT current state

Accord

Accord is poised to significantly reshape Apache Cassandra's future and
stands out as one of the most impactful developments on the horizon. The
community is genuinely excited about its potential.

That said, the recent mailing list update
 on
Accord (CEP-15) highlights that substantial work remains to mature the
protocol entirely. In addition, real-world testing is still needed to
validate its readiness. Beyond that, users will require additional time to
evaluate and adopt Cassandra 6.x in their environments.

LWT

On the other hand, LWT has been proven and has been hitting production at
scale for many years.

#3 Dev work for CEP-48

The CEP-48 design has two major components.

   1.

   Online path (CQL Mutations)

This section focuses on the LWT code path where any mutation to a base
table (via CQL insert, update, or delete) reliably triggers the
corresponding materialized view (MV) update. The development effort
required for this part is relatively limited, accounting for approximately
30% of the total work.

If we need to implement this on Accord, this would be a similar effort as
the LWT.

   2.

   Offline path (MV Data Repair)

The MV repair tool in Cassandra is intended to address inconsistencies that
may occur in materialized views due to various factors. This component is
the most complex and demanding part of the development effort, representing
roughly 70% of the overall work.

#4 Accord is mentioned as a Future Alternative in CEP-48

Accord has always been top of mind, and we genuinely appreciate the thought
and effort that has gone into its design and implementation -  We’re
excited about the changes, and if you look at the CEP-48 proposal, Accord
is listed as a 'Future Alternative' — not as a 'Rejected Alternative' — to
make clear that we continue to see value in its approach and are not
opposed to it.


Based on #1, #2, #3, and #4, here is my thinking:

*Scenario#1*: CEP-15 prod takes longer than CEP-48 merge

Since we're starting with LWT, there is no dependency on the progress of
CEP-15. This means the community can benefit from CEP-48 independently of
CEP-15's timeline. Additionally, it's possible to backport the changes from
trunk to the current broadly adopted Cassandra release (4.1.x), enabling
adoption before upgrading to 6.x.

*Scenario#2*: CEP-15 prod qualified before CEP-48 merge

As noted in #3, developing on top of Accord is a relatively small effort of
the overall CEP-48 scope. Therefore, we can implement using Accord before
merging CEP-48 into trunk, allowing us to forgo the LWT-based approach.

Given that the work required to support Accord is relatively limited and
that it would eliminate a dependency on a feature that is still maturing,
proceeding with LWT is the most reliable path forward. Please feel free to
share your thoughts.


On Thu, May 8, 2025 at 9:00 AM Jon Haddad  wrote:

> Based on David and Blake’s responses, it sounds like we don’t need to
> block on anything.
>
> I realize you may be making a broader point, but in this instance it
> sounds like there’s nothing here preventing an accord based MV
> implementation. Now that i understand more about how it would be done, it
> also sounds a lot simpler.
>
>
>
>
> On Thu, May 8, 2025 at 8:50 AM Josh McKenzie  wrote:
>
>> IMHO, focus should be on accord-based MVs.  Even if that means it's
>> blocked on first adding support for multiple conditions.
>>
>> Strongly disagree here. We should develop features to be as loosely
>> coupled w/one another as possible w/an eye towards future compatibility and
>> leverage but not block development of one functionality on something else
>> unless absolutely required for the feature to work (I'm defining "work"
>> here as "hits user requirements with affordances consistent w/the rest of
>> our ecosystem").
>>
>> With the logic of deferring to another feature, it would have been quite
>> reasonable for someone to make this same statement back in fall of '23 when
>> we were discussing delaying 5.0 for Accord's merge. But things come up, the
>> space we're in is complex, and cutting edge distributed things are Hard.
>>
>>
>> On Thu, May 8, 2025, at 11:13 AM, Mick Semb Wever wrote:
>>
>>
>>
>>
>> Curious what others think though.  I'm +1 on the spirit of getting MVs to
>>

[DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-06 Thread Runtian Liu
Hi everyone,

We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
First-Class Materialized View Support

.

This CEP focuses on addressing the long-standing consistency issues in the
current Materialized View (MV) implementation by introducing a new
architecture that keeps base tables and MVs reliably in sync. It also adds
a new validation and repair type to Cassandra’s repair process to support
MV repair based on the base table. The goal is to make MV a first-class,
production-ready feature that users can depend on—without relying on
external reconciliation tools or custom workarounds.

We’d really appreciate your feedback—please keep the discussion on this
mailing list thread.

Thanks,
Runtian


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-06 Thread Runtian Liu
Thanks for the questions. A few clarifications:

   -

   *Performance impact & opt-in model:* The new MV synchronization
   mechanism is fully opt-in. We understand that LWT-backed writes may
   introduce performance overhead, so users who prefer higher throughput over
   strict consistency can continue using the existing MV implementation. The
   new strict consistency mode can be toggled via a table-level option.
   -

   *Support for both implementations:* Even if this CEP is accepted, the
   current MV behavior will remain available. Users will have the flexibility
   to enable or disable the new mode as needed.
   -

   *Repair frequency:* MV inconsistency detection and repair is integrated
   with Cassandra’s existing repair framework. It can be triggered manually
   via nodetool or scheduled using the auto-repair infrastructure (per
   CEP-37), allowing operators to control how frequently repairs run.


On Tue, May 6, 2025 at 7:09 PM guo Maxwell  wrote:

> If the entire write operation involves additional LWTs to change the MV,
> it is uncertain whether users can accept the performance loss of such write
> operations.
>
> If this CEP is finally accepted, I think users should at least be given
> the choice of whether to use the old method or the new method, because
> after all, some users pursue performance rather than strict data
> consistency(we can provide the ability of disabling or enabling the new mv
> mv synchronization mechanism).
>
> Another question : What  is the frequency of inconsistency detection and
> repair  for mv and base table ?
>
> Runtian Liu  于2025年5月7日周三 06:51写道:
>
>> Hi everyone,
>>
>> We’d like to propose a new Cassandra Enhancement Proposal: CEP-48:
>> First-Class Materialized View Support
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support>
>> .
>>
>> This CEP focuses on addressing the long-standing consistency issues in
>> the current Materialized View (MV) implementation by introducing a new
>> architecture that keeps base tables and MVs reliably in sync. It also adds
>> a new validation and repair type to Cassandra’s repair process to support
>> MV repair based on the base table. The goal is to make MV a first-class,
>> production-ready feature that users can depend on—without relying on
>> external reconciliation tools or custom workarounds.
>>
>> We’d really appreciate your feedback—please keep the discussion on this
>> mailing list thread.
>>
>> Thanks,
>> Runtian
>>
>


Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-13 Thread Runtian Liu
and when
>> it's done, mark the view's SSTable as repaired?  Then the repair process
>> would only need to do a MV to MV repair.
>>
>> Jon
>>
>>
>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
>> bened...@apache.org> wrote:
>>
>>> Like something doesn't add up here because if it always includes the
>>> base table's primary key columns that means they could be storage attached
>>> by just forbidding additional columns and there doesn't seem to be much
>>> utility in including additional columns in the primary key?
>>>
>>>
>>> You can re-order the keys, and they only need to be a part of the
>>> primary key not the partition key. I think you can specify an arbitrary
>>> order to the keys also, so you can change the effective sort order. So, the
>>> basic idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>>>
>>> This is basically a global index, with the restriction on single columns
>>> as keys only because we cannot cheaply read-before-write for eventually
>>> consistent operations. This restriction can easily be relaxed for Paxos and
>>> Accord based implementations, which can also safely include additional keys.
>>>
>>> That said, I am not at all sure why they are called materialised views
>>> if we don’t support including any other data besides the lookup column and
>>> the primary key. We should really rename them once they work, both to make
>>> some sense and to break with the historical baggage.
>>>
>>> I think this can be represented as a tombstone which can always be
>>> fetched from the base table on read or maybe some other arrangement? I
>>> agree it can't feasibly be represented as an enumeration of the deletions
>>> at least not synchronously and doing it async has its own problems.
>>>
>>>
>>> If the base table must be read on read of an index/view, then I think
>>> this proposal is approximately linearizable for the view as well (though, I
>>> do not at all warrant this statement). You still need to propagate this
>>> eventually so that the views can cleanup. This also makes reads 2RT on
>>> read, which is rather costly.
>>>
>>> On 12 May 2025, at 16:10, Ariel Weisberg  wrote:
>>>
>>> Hi,
>>>
>>> I think it's worth taking a step back and looking at the current MV
>>> restrictions which are pretty onerous.
>>>
>>> A view must have a primary key and that primary key must conform to the
>>> following restrictions:
>>>
>>>- it must contain all the primary key columns of the base table.
>>>This ensures that every row of the view correspond to exactly one row of
>>>the base table.
>>>- it can only contain a single column that is not a primary key
>>>column in the base table.
>>>
>>> At that point what exactly is the value in including anything except the
>>> original primary key in the MV's primary key columns unless you are using
>>> an ordered partitioner so you can iterate based on the leading primary key
>>> columns?
>>>
>>> Like something doesn't add up here because if it always includes the
>>> base table's primary key columns that means they could be storage attached
>>> by just forbidding additional columns and there doesn't seem to be much
>>> utility in including additional columns in the primary key?
>>>
>>> I'm not that clear on how much better it is to look something up in the
>>> MV vs just looking at the base table or some non-materialized view of it.
>>> How exactly are these MVs supposed to be used and what value do they
>>> provide?
>>>
>>> Jeff Jirsa wrote:
>>>
>>> There’s 2 things in this proposal that give me a lot of pause.
>>>
>>>
>>> Runtian Liu pointed out that the CEP is sort of divided into two parts.
>>> The first is the online part which is making reads/writes to MVs safer and
>>> more reliable using a transaction system. The second is offline which is
>>> repair.
>>>
>>> The story for the online portion I think is quite strong and worth
>>> considering on its own merits.
>>>
>>> The offline portion (repair) sounds a little less feasible to run in
>>> production, but I also think that MVs without any mechanism for checking
>>> their consistency are not viable to run in production. So it's kind of pay

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-16 Thread Runtian Liu
Unfortunately, no. When building Merkle trees for small token ranges in the
base table, those ranges may span the entire MV token range. As a result,
we need to scan the entire MV to generate all the necessary Merkle trees.
For efficiency, we perform this as a single pass over the entire table
rather than scanning a small range of the base or MV table individually. As
you mentioned, with storage becoming increasingly affordable, this approach
helps us save time and CPU resources.

On Fri, May 16, 2025 at 12:11 PM Jon Haddad  wrote:

> I spoke too soon - endless questions are not over :)
>
> Since the data that's going to be repaired only covers a range, I wonder
> if it makes sense to have the ability to issue a minimalist snapshot that
> only hardlinks SSTables that are in a token range.  Based on what you
> (Runtian) have said above, only a small percentage of the data would
> actually be repaired at any given time.
>
> Just a thought to save a little filesystem churn.
>
>
> On Fri, May 16, 2025 at 10:55 AM Jon Haddad 
> wrote:
>
>> Nevermind about the height thing i guess its the same property.
>>
>> I’m done for now :)
>>
>> Thanks for entertaining my endless questions. My biggest concerns about
>> repair have been alleviated.
>>
>> Jon
>>
>> On Fri, May 16, 2025 at 10:34 AM Jon Haddad 
>> wrote:
>>
>>> Thats the critical bit i was missing, thank you Blake.
>>>
>>> I guess we’d need to have unlimited height trees then, since you’d need
>>> to be able to update the hashes of individual partitions, and we’d also
>>> need to propagate the hashes up every time as well. I’m curious what the
>>> cost will look like with that.
>>>
>>> At least it’s a cpu problem not an I/O one.
>>>
>>> Jon
>>>
>>>
>>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston 
>>> wrote:
>>>
>>>> The merkle tree xor's the individual row hashes together, which is
>>>> commutative. So you should be able to build a tree in the view token order
>>>> while reading in base table token order and vise versa.
>>>>
>>>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>>>>
>>>> Thanks for the explanation, I appreciate it.  I think you might still
>>>> be glossing over an important point - which I'll make singularly here.
>>>> There's a number of things I'm concerned about, but this is a big one.
>>>>
>>>> Calculating the hash of a partition for a Merkle tree needs to be done
>>>> on the fully materialized, sorted partition.
>>>>
>>>> The examples you're giving are simple, to the point where they hide the
>>>> problem.  Here's a better example, where the MV has a clustering column. In
>>>> the MV's partition it'll have multiple rows, but in the base table it'll be
>>>> stored in different pages or different SSTables entirely:
>>>>
>>>> CREATE TABLE test.t1 (
>>>> id int PRIMARY KEY,
>>>> v1 int
>>>> );
>>>>
>>>> CREATE MATERIALIZED VIEW test.test_mv AS
>>>> SELECT v1, id
>>>> FROM test.t1
>>>> WHERE id IS NOT NULL AND v1 IS NOT NULL
>>>> PRIMARY KEY (v1, id)
>>>>  WITH CLUSTERING ORDER BY (id ASC);
>>>>
>>>>
>>>> Let's say we have some test data:
>>>>
>>>> cqlsh:test> select id, v1 from t1;
>>>>
>>>>  id | v1
>>>> +
>>>>  10 | 11
>>>>   1 | 14
>>>>  19 | 10
>>>>   2 | 14
>>>>   3 | 14
>>>>
>>>> When we transform the data by iterating over the base table, we get
>>>> this representation (note v1=14):
>>>>
>>>> cqlsh:test> select v1, id from t1;
>>>>
>>>>  v1 | id
>>>> +
>>>>  11 | 10
>>>>  14 |  1   <--
>>>>  10 | 19
>>>>  14 |  2 <--
>>>>  14 |  3  <--
>>>>
>>>>
>>>> The partiton key in the new table is v1.  If you simply iterate and
>>>> transform and calculate merkle trees on the fly, you'll hit v1=14 with
>>>> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
>>>> and in sorted order, before you calculate the hash.  You actually need to
>>>> transform the data to this, prior to calculating the tree:
>>>&

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
The previous table compared the complexity of full repair and MV repair
when reconciling one dataset with another. In production, we typically use
a replication factor of 3 in one datacenter. This means full repair
involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
Below is an updated comparison table reflecting this scenario.

n: number of rows to repair (Total rows in the table)

d: depth of one Merkle tree for MV repair

r: number of split ranges

p: data compacted away


This comparison focuses on the complexities of one round of full repair
with a replication factor of 3 versus repairing a single MV based on one
base table with replication factor 3.

Full Repair

MV Repair

Comment

Extra disk used

0

O(2*p)

Since we take a snapshot at the beginning of the repair, any disk space
that would normally be freed by compaction will remain occupied until the
Merkle trees are successfully built and the snapshot is cleared.

Data scan complexity

O(3*n)

O(6*n)

Full repair scans n rows from the primary and 2n from replicas.3

MV repair scans 3n rows from the base table and 3n from the MV.

Merkle Tree building time complexity

O(3n)

O(6*n*d)

In full repair, Merkle tree building is O(1) per row—each hash is added
sequentially to the leaf nodes.

In MV repair, each hash is inserted from the root, making it O(d) per row.
Since d is typically small (less than 20 and often smaller than in full
repair), this isn’t a major concern.

Total Merkle tree count

O(3*r)

O(6*r^2)

MV repair needs to generate more, smaller Merkle trees, but this isn’t a
concern as they can be persisted to disk during the repair process.

Merkle tree comparison complexity

O(3n)

O(3n)

Assuming one row maps to one leaf node, both repairs are equivalent.

Stream time complexity

O(3n)

O(3n)

Assuming all rows need to be streamed, both repairs are equivalent.

In short: Even for production use cases having RF=3 in one data center, we
can see that the MV repair consumes temporary disk space and a small,
usually negligible amount of extra CPU for tree construction; other costs
match full repair.

Additionally, with the online path proposed in this CEP, we expect
mismatches to be rare, which can lower the frequency of running this repair
process compared to full repair.


On Thu, May 15, 2025 at 9:53 AM Jon Haddad  wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think this is a distinction without a difference. Merkle tree repair
> works because the ordering of the data is mostly the same across nodes.
>
>
> On Thu, May 15, 2025 at 9:27 AM Runtian Liu  wrote:
>
>> > what we're trying to achieve here is comparing two massive unordered
>> sets.
>>
>> They are not two unordered sets, but rather two sets ordered by different
>> keys. This means that when building Merkle trees for the base table and the
>> materialized view (MV), we need to use different strategies to ensure the
>> trees can be meaningfully compared.
>>
>> To address scalability concerns for MV repair, I’ve included a comparison
>> between one round of full repair and MV repair in the table below. This
>> comparison is also added to the CEP.
>>
>> n: number of rows to repair (Total rows in the table)
>>
>> d: depth of one Merkle tree for MV repair
>>
>> r: number of split ranges
>>
>> p: data compacted away
>>
>>
>> This comparison focuses on the complexities of one round of full repair
>> with a replication factor of 2 versus repairing a single MV based on one
>> base table replica.
>>
>> Full Repair
>>
>> MV Repair
>>
>> Comment
>>
>> Extra disk used
>>
>> 0
>>
>> O(2*p)
>>
>> Since we take a snapshot at the beginning of the repair, any disk space
>> that would normally be freed by compaction will remain occupied until the
>> Merkle trees are successfully built and the snapshot is cleared.
>>
>> Data scan complexity
>>
>> O(2*n)
>>
>> O(2*n)
>>
>> Full repair scans n rows from the primary and n from replicas.
>>
>> MV repair scans n rows from the base table primary replica only, and n
>> from the MV primary replica only.
>>
>> Merkle Tree building time complexity
>>
>> O(n)
>>
>> O(n*d)
>>
>> In full repair, Merkle tree building is O(1) per row—each hash is added
>> sequentially to the leaf nodes.
>>
>> In MV repair, each hash is inserted from the root, making it O(d) per
>> row. Since d is typically small (less than 20 and often smaller than in
>> full repair), this isn’t a major concern.
>>
>> Total Merkle tree count
>>
>> O(2*r)
>>
>> O(2*r^2)
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Runtian Liu
For each row, when calculating its hash, we first need to merge all the
SSTables that contain that row. We cannot attach a Merkle tree directly to
each SSTable, because merged Merkle trees would produce different hash
values for the same data if the compaction states differ.

On Sat, May 17, 2025 at 5:48 PM Jon Haddad  wrote:

> Could we could do that for regular repair as well? which would make a
> validation possible with barely any IO?
>
> Sstable attached merkle trees?
>
>
>
>
> On Sat, May 17, 2025 at 5:36 PM Jon Haddad 
> wrote:
>
>> What if you built the merkle tree for each sstable as a storage attached
>> index?
>>
>> Then your repair is merging merkle tables.
>>
>>
>> On Sat, May 17, 2025 at 4:57 PM Runtian Liu  wrote:
>>
>>> > I think you could exploit this to improve your MV repair design.
>>> Instead of taking global snapshots and persisting merkle trees, you could
>>> implement a set of secondary indexes on the base and view tables that you
>>> could quickly compare the contents of for repair.
>>>
>>> We actually considered this approach while designing the MV repair.
>>> However, there are several downsides:
>>>
>>>1.
>>>
>>>It requires additional storage for the index files.
>>>2.
>>>
>>>Data scans during repair would become random disk accesses instead
>>>of sequential ones, which can degrade performance.
>>>3.
>>>
>>>Most importantly, I decided against this approach due to the
>>>complexity of ensuring index consistency. Introducing secondary indexes
>>>opens up new challenges, such as keeping them in sync with the actual 
>>> data.
>>>
>>> The goal of the design is to provide a catch-all mismatch detection
>>> mechanism that targets the dataset users query during the online path. I
>>> did consider adding indexes at the SSTable level to guarantee consistency
>>> between indexes and data.
>>> > sorted by base table partition order, but segmented by view partition
>>> ranges
>>> If the indexes at the SSTable level, it means it will be less flexible,
>>> we need to rewrite the SSTables if we decide to range the view partition
>>> ranges.
>>> I didn’t explore this direction further due to the issues listed above.
>>>
>>> > The transformative repair could be done against the local index, and
>>> the local index can repair against the global index. It opens up a lot of
>>> possibilities, query wise, as well.
>>> This is something I’m not entirely sure about—how exactly do we use the
>>> local index to support the global index (i.e., the MV)? If the MV relies on
>>> local indexes during the query path, we can definitely dig deeper into how
>>> repair could work with that design.
>>>
>>> The proposed design in this CEP aims to treat the base table and its MV
>>> like any other regular tables, so that operations such as compaction and
>>> repair can be handled in the same way in most cases.
>>>
>>> On Sat, May 17, 2025 at 2:42 PM Jon Haddad 
>>> wrote:
>>>
>>>> Yeah, this is exactly what i suggested in a different part of the
>>>> thread. The transformative repair could be done against the local index,
>>>> and the local index can repair against the global index. It opens up a lot
>>>> of possibilities, query wise, as well.
>>>>
>>>>
>>>>
>>>> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
>>>> wrote:
>>>>
>>>>> > They are not two unordered sets, but rather two sets ordered by
>>>>> different keys.
>>>>>
>>>>> I think you could exploit this to improve your MV repair design.
>>>>> Instead of taking global snapshots and persisting merkle trees, you could
>>>>> implement a set of secondary indexes on the base and view tables that you
>>>>> could quickly compare the contents of for repair.
>>>>>
>>>>> The indexes would have their contents sorted by base table partition
>>>>> order, but segmented by view partition ranges. Then any view <-> base
>>>>> repair would compare the intersecting index slices. That would allow you 
>>>>> to
>>>>> repair data more quickly and with less operational complexity.
>>>>>
>>>>> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>>>>>
>>>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-17 Thread Runtian Liu
> I think you could exploit this to improve your MV repair design. Instead
of taking global snapshots and persisting merkle trees, you could implement
a set of secondary indexes on the base and view tables that you could
quickly compare the contents of for repair.

We actually considered this approach while designing the MV repair.
However, there are several downsides:

   1.

   It requires additional storage for the index files.
   2.

   Data scans during repair would become random disk accesses instead of
   sequential ones, which can degrade performance.
   3.

   Most importantly, I decided against this approach due to the complexity
   of ensuring index consistency. Introducing secondary indexes opens up new
   challenges, such as keeping them in sync with the actual data.

The goal of the design is to provide a catch-all mismatch detection
mechanism that targets the dataset users query during the online path. I
did consider adding indexes at the SSTable level to guarantee consistency
between indexes and data.
> sorted by base table partition order, but segmented by view partition
ranges
If the indexes at the SSTable level, it means it will be less flexible, we
need to rewrite the SSTables if we decide to range the view partition
ranges.
I didn’t explore this direction further due to the issues listed above.

> The transformative repair could be done against the local index, and the
local index can repair against the global index. It opens up a lot of
possibilities, query wise, as well.
This is something I’m not entirely sure about—how exactly do we use the
local index to support the global index (i.e., the MV)? If the MV relies on
local indexes during the query path, we can definitely dig deeper into how
repair could work with that design.

The proposed design in this CEP aims to treat the base table and its MV
like any other regular tables, so that operations such as compaction and
repair can be handled in the same way in most cases.

On Sat, May 17, 2025 at 2:42 PM Jon Haddad  wrote:

> Yeah, this is exactly what i suggested in a different part of the thread.
> The transformative repair could be done against the local index, and the
> local index can repair against the global index. It opens up a lot of
> possibilities, query wise, as well.
>
>
>
> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston 
> wrote:
>
>> > They are not two unordered sets, but rather two sets ordered by
>> different keys.
>>
>> I think you could exploit this to improve your MV repair design. Instead
>> of taking global snapshots and persisting merkle trees, you could implement
>> a set of secondary indexes on the base and view tables that you could
>> quickly compare the contents of for repair.
>>
>> The indexes would have their contents sorted by base table partition
>> order, but segmented by view partition ranges. Then any view <-> base
>> repair would compare the intersecting index slices. That would allow you to
>> repair data more quickly and with less operational complexity.
>>
>> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> For example, in the chart above, each cell represents a Merkle tree that
>> covers data belonging to a specific base table range and a specific MV
>> range. When we scan a base table range, we can generate the Merkle trees
>> marked in red. When we scan an MV range, we can generate the Merkle trees
>> marked in green. The cells that can be compared are marked in blue.
>>
>> To save time and CPU resources, we persist the Merkle trees created
>> during a scan so we don’t need to regenerate them later. This way, when
>> other nodes scan and build Merkle trees based on the same “frozen”
>> snapshot, we can reuse the existing Merkle trees for comparison.
>>
>> On Fri, May 16, 2025 at 12:22 PM Runtian Liu  wrote:
>>
>> Unfortunately, no. When building Merkle trees for small toke

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-18 Thread Runtian Liu
> If you had a custom SAI index or something, this isn’t something you’d
need to worry about
This is what I missed.

I think this could be a potential solution, but comparing indexes alone
isn’t sufficient—it only handles cases where the MV has extra or missing
rows. It doesn’t catch data mismatches for rows that exist in both the base
table and MV. To address that, we may need to extend SAI for MV to store
the entire selected dataset in the index file, applying the same approach
to MV as we do for the base table. This would increase storage to roughly
4x per MV, compared to the current 2x, but it would help avoid random disk
access during repair. I’m not sure if this would introduce any memory
issues during compaction.


On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
wrote:

> It *might* be more efficient, but it’s also more brittle. I think it
> would be more fault tolerant and less trouble overall to repair
> intersecting token ranges. So you’re not repairing a view partition, you’re
> repairing the parts of a view partition that intersect with a base table
> token range.
>
> The issues I see with the global snapshot are:
>
> 1. Requiring a global snapshot means that you can’t start a new repair
> cycle if there’s a node down.
> 2. These merkle trees can’t all be calculated at once, so we’ll need a
> coordination mechanism to spread out scans of the snapshots
> 3. By requiring a global snapshot and then building merkle trees from that
> snapshot, you’re introducing a delay of however long it takes you to do a
> full scan of both tables. So if you’re repairing your cluster every 3 days,
> it means the last range to get repaired is repairing based on a state
> that’s now 3 days old. This makes your repair horizon 2x your scheduling
> cadence and puts an upper bound on how up to date you can keep your view.
>
> With an index based approach, much of the work is just built into the
> write and compaction paths and repair is just a scan of the intersecting
> index segments from the base and view tables. You’re also repairing from
> the state that existed when you started your repair, so your repair horizon
> matches your scheduling cadence.
>
> On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>
> >Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
> Exactly. Since materialized views (MVs) are partitioned differently from
> their base tables, there doesn’t appear to be a more efficient way to
> repair them in a targeted manner—meaning we can’t restrict the repair to
> only a small portion of the data.
>
> Jaydeep
>
> On Sun, May 18, 2025 at 5:57 PM Jeff Jirsa  wrote:
>
>
> Isn’t the reality here is that repairing a single partition in the base
> table is potentially a full cluster-wide scan of the MV if you also want to
> detect rows in the MV that don’t exist in the base table (eg resurrection
> or a missed delete)
>
> There’s no getting around that. Keeping an extra index doesn’t avoid that
> scan, it just moves the problem around to another tier.
>
>
>
> On May 18, 2025, at 4:59 PM, Blake Eggleston  wrote:
>
> 
> Whether it’s index based repair or another mechanism, I think the proposed
> repair design needs to be refined. The requirement of a global snapshot and
> merkle tree build before we can start detecting and fixing problems is a
> pretty big limitation.
>
> > Data scans during repair would become random disk accesses instead of
> sequential ones, which can degrade performance.
>
> You’d only be reading and comparing the index files, not the sstable
> contents. Reads would still be sequential.
>
> > Most importantly, I decided against this approach due to the complexity
> of ensuring index consistency. Introducing secondary indexes opens up new
> challenges, such as keeping them in sync with the actual data.
>
> I think this is mostly a solved problem in C*? If you had a custom SAI
> index or something, this isn’t something you’d need to worry about AFAIK.
>
> On Sat, May 17, 2025, at 4:57 PM, Runtian Liu wrote:
>
> > I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> We actually considered this approach while designing the MV repair.
> However, there are several downsides:
>
>1.
>
>It requires additional storage for the index files.
>2.
>
>Data scans during repair would become random di

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-19 Thread Runtian Liu
> You don’t need to duplicate the full data set in the index, you just need
enough info to detect that something is missing.
Could you please explain how this would work?
If we build Merkle trees or compute hashes at the SSTable level, how would
this case be handled?
For example, consider the following table schema:
CREATE TABLE (
  pk int PRIMARY KEY,
  v1 int,
  v2 int,
  v3 int
);

Suppose the data is stored as follows:
*Node 1*

   -

   SSTable1: (1, 1, null, 1)
   -

   SSTable2: (1, null, 1, 1)

*Node 2*

   -

   SSTable3: (1, 1, 1, 1)

How can we ensure that a hash or Merkle tree computed at the SSTable level
would produce the same result on both nodes for this row?

On Mon, May 19, 2025 at 1:54 PM Jon Haddad  wrote:

> We could also track the deletes that need to be made to the view, in
> another SSTable component on the base.  That way you can actually do repair
> with tombstones.
>
>
>
>
> On Mon, May 19, 2025 at 11:37 AM Blake Eggleston 
> wrote:
>
>> If we went the storage attached route then I think you’d just need more
>> memory for the memtable, compaction would just be combining 2 sorted sets,
>> though there would probably be some additional work related to deletes,
>> overwrites, and tombstone purging.
>>
>> Regarding the size of the index, I think Jon was on the right track with
>> his sstable attached merkle tree idea. You don’t need to duplicate the full
>> data set in the index, you just need enough info to detect that something
>> is missing. If you can detect that view partition x is missing data from
>> base partition y, then you could start comparing the actual partition data
>> and figure out who’s missing what.
>>
>> On Sun, May 18, 2025, at 9:20 PM, Runtian Liu wrote:
>>
>> > If you had a custom SAI index or something, this isn’t something you’d
>> need to worry about
>> This is what I missed.
>>
>> I think this could be a potential solution, but comparing indexes alone
>> isn’t sufficient—it only handles cases where the MV has extra or missing
>> rows. It doesn’t catch data mismatches for rows that exist in both the base
>> table and MV. To address that, we may need to extend SAI for MV to store
>> the entire selected dataset in the index file, applying the same approach
>> to MV as we do for the base table. This would increase storage to roughly
>> 4x per MV, compared to the current 2x, but it would help avoid random disk
>> access during repair. I’m not sure if this would introduce any memory
>> issues during compaction.
>>
>>
>>
>> On Sun, May 18, 2025 at 8:09 PM Blake Eggleston 
>> wrote:
>>
>>
>> It *might* be more efficient, but it’s also more brittle. I think it
>> would be more fault tolerant and less trouble overall to repair
>> intersecting token ranges. So you’re not repairing a view partition, you’re
>> repairing the parts of a view partition that intersect with a base table
>> token range.
>>
>> The issues I see with the global snapshot are:
>>
>> 1. Requiring a global snapshot means that you can’t start a new repair
>> cycle if there’s a node down.
>> 2. These merkle trees can’t all be calculated at once, so we’ll need a
>> coordination mechanism to spread out scans of the snapshots
>> 3. By requiring a global snapshot and then building merkle trees from
>> that snapshot, you’re introducing a delay of however long it takes you to
>> do a full scan of both tables. So if you’re repairing your cluster every 3
>> days, it means the last range to get repaired is repairing based on a state
>> that’s now 3 days old. This makes your repair horizon 2x your scheduling
>> cadence and puts an upper bound on how up to date you can keep your view.
>>
>> With an index based approach, much of the work is just built into the
>> write and compaction paths and repair is just a scan of the intersecting
>> index segments from the base and view tables. You’re also repairing from
>> the state that existed when you started your repair, so your repair horizon
>> matches your scheduling cadence.
>>
>> On Sun, May 18, 2025, at 7:45 PM, Jaydeep Chovatia wrote:
>>
>> >Isn’t the reality here is that repairing a single partition in the base
>> table is potentially a full cluster-wide scan of the MV if you also want to
>> detect rows in the MV that don’t exist in the base table (eg resurrection
>> or a missed delete)
>> Exactly. Since materialized views (MVs) are partitioned differently from
>> their base tables, there doesn’t appear to be a more efficient way to
>> repair them in a targeted manner—meaning we can’t restrict the repair to
>> onl

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-05-15 Thread Runtian Liu
s like use LOCAL_QUORUM without leaving themselves open to
> introducing view inconsistencies.
>
> That would more or less eliminate the need for any MV repair in normal
> usage, but wouldn't address how to repair issues caused by bugs or data
> loss, though you may be able to do something with comparing the latest
> mutation ids for the base tables and its view ranges.
>
> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>
> I don't see mutation tracking [1] mentioned in this thread or in the
> CEP-48 description. Not sure this would fit into the scope of this
> initial CEP, but I have a feeling that mutation tracking could be
> potentially helpful to reconcile base tables and views ?
>
> For example, when both base and view updates are acknowledged then this
> could be somehow persisted in the view sstables mutation tracking
> summary[2] or similar metadata ? Then these updates would be skipped during
> view repair, considerably reducing the amount of work needed, since only
> un-acknowledged views updates would need to be reconciled.
>
> [1] -
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>
> On Wed, May 14, 2025 at 12:59 PM Paulo Motta 
> wrote:
>
> > - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> I haven't checked the CEP yet so I may be missing out something but I
> think this effort doesn't need to be conflated with dense node support, to
> make this more approachable. I think prospective users would be OK with
> overprovisioning to make this feasible if needed. We could perhaps have
> size guardrails that limit the maximum table size per node when MVs are
> enabled. Ideally we should make it work for dense nodes if possible, but
> this shouldn't be a reason not to support the feature if it can be made to
> work reasonably with more resources.
>
> I think the main issue with the current MV is about correctness, and the
> ultimate goal of the CEP must be to provide correctness guarantees, even if
> it has an inevitable performance hit. I think that the performance of the
> repair process is definitely an important consideration and it would be
> helpful to have some benchmarks to have an idea of how long this repair
> process would take for lightweight and denser tables.
>
> On Wed, May 14, 2025 at 7:28 AM Jon Haddad 
> wrote:
>
> I've got several concerns around this repair process.
>
> - The first thing I notice is that we're talking about repairing the
> entire table across the entire cluster all in one go.  It's been a *long*
> time since I tried to do a full repair of an entire table without using
> sub-ranges.  Is anyone here even doing that with clusters of non trivial
> size?  How long does a full repair of a 100 node cluster with 5TB / node
> take even in the best case scenario?
>
> - Even in a scenario where sub-range repair is supported, you'd have to
> scan *every* sstable on the base table in order to construct the a merkle
> tree, as we don't know in advance which SSTables contain the ranges that
> the MV will.  That means a subrange repair would have to do a *ton* of IO.
> Anyone who's mis-configured a sub-range incremental repair to use too many
> ranges will probably be familiar with how long it can take to anti-compact
> a bunch of SSTables.  With MV sub-range repair, we'd have even more
> overhead, because we'd have to read in every SSTable, every time.  If we do
> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
> is practical.
>
> - Merkle trees make sense when you're comparing tables with the same
> partition key, but I don't think they do when you're transforming a base
> table to a view.  When there's a mis-match, what's transferred?  We have a
> range of data in the MV, but now we have to go find that from the base
> table.  That means the merkle tree needs to not just track the hashes and
> ranges, but the original keys it was transformed from, in order to go find
> all of the matching partitions in that mis-matched range.  Either that or
> we end up rescanning the entire dataset in order to find the mismatches.
>
> Jon
>
>
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-22 Thread Runtian Liu
> visible. I’m not sure how this would work, and the idea to use
> anti-compaction makes a lot more sense now (in principle - I don’t think
> it’s workable in practice). I guess you could add some sort of assassin
> cell that is meant to remove a cell with a specific timestamp and value,
> but is otherwise invisible. This seems dangerous though, since it’s likely
> there’s a replication problem that may resolve itself and the repair
> process would actually be removing data that the user intended to write.
>
> Paulo - I don’t think storage changes are off the table, but they do
> expand the scope and risk of the proposal, so we should be careful.
>
> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote:
>
>  > I’m not sure if this is the only edge case—there may be other issues as
> well. I’m also unsure whether we should redesign the tombstone handling for
> MVs, since that would involve changes to the storage engine. To minimize
> impact there, the original proposal was to rebuild the affected ranges
> using anti-compaction, just to be safe.
>
> I haven't been following the discussion but I think one of the issues with
> the materialized view "strict liveness" fix[1] is that we avoided making
> invasive changes to the storage engine at the time, but this was considered
> by Zhao on [1]. I think we shouldn't be trying to avoid updates to the
> storage format as part of the MV implementation, if this is what it takes
> to make MVs V2 reliable.
>
> [1] -
> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603
>
> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu  wrote:
>
> The current design leverages strict liveness to shadow the old view row.
> When the view-indexed value changes from 'a' to 'b', no tombstone is
> written; instead, the old row is marked as expired by updating its liveness
> info with the timestamp of the change. If the column is later set back to
> 'a', the view row is re-inserted with a new, non-expired liveness info
> reflecting the latest timestamp.
>
> To delete an extra row in the materialized view (MV), we can likely use
> the same approach—marking it as shadowed by updating the liveness info.
> However, in the case of inconsistencies where a column in the MV has a
> higher timestamp than the corresponding column in the base table, this
> row-level liveness mechanism is insufficient.
>
> Even for the case where we delete the row by marking its liveness info as
> expired during repair, there are concerns. Since this introduces a data
> mutation as part of the repair process, it’s unclear whether there could be
> edge cases we’re missing. This approach may risk unexpected side effects if
> the repair logic is not carefully aligned with write path semantics.
>
> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston 
> wrote:
>
>
> That’s a good point, although as described I don’t think that could ever
> work properly, even in normal operation. Either we’re misunderstanding
> something, or this is a flaw in the current MV design.
>
> Assuming changing the view indexed column results in a tombstone being
> applied to the view row for the previous value, if we wrote the other base
> columns (the non view indexed ones) to the view with the same timestamps
> they have on the base, then changing the view indexed value from ‘a’ to
> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need
> to always update the column timestamps on the view to be >= the view column
> timestamp on the base
>
> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>
> > In the case of a missed update, we'll have a new value and we can send
> a tombstone to the view with the timestamp of the most recent update.
>
> > then something has gone wrong and we should issue a tombstone using the
> paxos repair timestamp as the tombstone timestamp.
>
> The current MV implementation uses “strict liveness” to determine whether
> a row is live. I believe that using regular tombstones during repair could
> cause problems. For example, consider a base table with schema (pk, ck, v1,
> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
> reason, we detect an extra row in the MV and delete it using a tombstone
> with the latest update timestamp, we may run into issues. Suppose we later
> update the base table’s v1 field to match the MV row we previously deleted,
> and the v2 value now has an older timestamp. In that case, the previously
> issued tombstone could still shadow the v2 column, which is unintended.
> That is why I was asking if we are going to introduce a 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-06 Thread Runtian Liu
/05 03:58:59 Blake Eggleston wrote:
> > You can detect and fix the mismatch in a single round of repair, but the
> amount of work needed to do it is _significantly_ higher with snapshot
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where
> each view partition contains entries mapping to every token range in the
> cluster - so 100 ranges. If we lose a view sstable, it will affect an
> entire row/column of the grid. Repair is going to scan all data in the
> mismatching view token ranges 100 times, and each base range once. So
> you’re looking at 200 range scans.
> >
> > Now, you may argue that you can merge the duplicate view scans into a
> single scan while you repair all token ranges in parallel. I’m skeptical
> that’s going to be achievable in practice, but even if it is, we’re now
> talking about the view replica hypothetically doing a pairwise repair with
> every other replica in the cluster at the same time. Neither of these
> options is workable.
> >
> > Let’s take a step back though, because I think we’re getting lost in the
> weeds.
> >
> > The repair design in the CEP has some high level concepts that make a
> lot of sense, the idea of repairing a grid is really smart. However, it has
> some significant drawbacks that remain unaddressed. I want this CEP to
> succeed, and I know Jon does too, but the snapshot repair design is not a
> viable path forward. It’s the first iteration of a repair design. We’ve
> proposed a second iteration, and we’re open to a third iteration. This part
> of the CEP process is meant to identify and address shortcomings, I don’t
> think that continuing to dissect the snapshot repair design is making
> progress in that direction.
> >
> > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > >  We potentially have to do it several times on each node, depending
> on the size of the range. Smaller ranges increase the size of the board
> exponentially, larger ranges increase the number of SSTables that would be
> involved in each compaction.
> > > As described in the CEP example, this can be handled in a single round
> of repair. We first identify all the points in the grid that require
> repair, then perform anti-compaction and stream data based on a second scan
> over those identified points. This applies to the snapshot-based
> solution—without an index, repairing a single point in that grid requires
> scanning the entire base table partition (token range). In contrast, with
> the index-based solution—as in the example you referenced—if a large block
> of data is corrupted, even though the index is used for comparison, many
> key mismatches may occur. This can lead to random disk access to the
> original data files, which could cause performance issues. For the case you
> mentioned for snapshot based solution, it should not take months to repair
> all the data, instead one round of repair should be enough. The actual
> repair phase is split from the detection phase.
> > >
> > >
> > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad 
> wrote:
> > >> > This isn’t really the whole story. The amount of wasted scans on
> index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > >>
> > >> You nailed it.
> > >>
> > >> When the base table is converted to a view, and sent to the view, the
> information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
> > >>
> > >> Consider the case of a single corrupted SSTable on the view that's
> removed from the filesystem, or the data is simply missing after being
> restored from an inconsistent backup.  It presumably contains lots of
> partitions, which maps to base partitions all over the cluster, in a lot of
> different token ranges.  For every one of those ranges (hundreds, to tens
> of thousands of them given the checkerboard design), when finding the
> missing data in the base, you'll have to perform a compaction across all
> the SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be s

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-10 Thread Runtian Liu
ases in size. That's the part that doesn't scale well.
>
> This is also one the benefits of the index design. Since it stores data in
> segments that roughly correspond to points on the grid, you’re not
> rereading the same data over and over. A repair for a given grid point only
> reads an amount of data proportional to the data in common for the
> base/view grid point, and it’s stored in a small enough granularity that
> the base can calculate what data needs to be sent to the view without
> having to read the entire view partition.
>
> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>
> Thanks, Blake. I’m open to iterating on the design, and hopefully we can
> come up with a solution that everyone agrees on.
>
> My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically. As mentioned
> earlier, this MV repair should be an infrequent operation, but the
> index-based approach shifts some of the work to the hot path in order to
> allow repairs that touch only a few nodes.
>
> I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> > it degrades operators ability to react to data problems by imposing a
> significant upfront processing burden on repair, and that it doesn’t scale
> well with cluster size
>
> I’m not sure what you mean by “data problems” here. Also, this does scale
> with cluster size—I’ve compared it to full repair, and this MV repair
> should behave similarly. That means as long as full repair works, this
> repair should work as well.
>
> For example, regardless of how large the cluster is, you can always enable
> Merkle tree building on 10% of the nodes at a time until all the trees are
> ready.
>
> I understand that coordinating this type of repair is harder than what we
> currently support, but with CEP-37, we should be able to handle this
> coordination without adding too much burden on the operator side.
>
> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston 
> wrote:
>
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work.
>
>
> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a
> competition. We can collaborate and figure out the best approach to the
> problem. I’d like to keep discussing it if you’re open to iterating on the
> design.
>
> I’m not married to our proposal, it’s just the cleanest way we could think
> of to address what Jon and I both see as blockers in the current proposal.
> It’s not set in stone though.
>
> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>
> Hmm, I am very surprised as I helped write that and I distinctly recall a
> specific goal was avoiding binding vetoes as they're so toxic.
>
> Ok, I guess you can block this work if you like.
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work. That isn't a functioning
> community in my mind, so I'll be noping out for a while, as I don't see
> much value here right now.
>
>
> On 2025/06/06 18:31:08 Blake Eggleston wrote:
> > Hi Benedict, that’s actually not true.
> >
> > Here’s a link to the project governance page: _https://
> cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> >
> > The CEP section says:
> >
> > “*Once the proposal is finalized and any major committer dissent
> reconciled, call a [VOTE] on the ML to have the proposal adopted. The
> criteria for acceptance is consensus (3 binding +1 votes and no binding
> vetoes). The vote should remain open for 72 hours.*”
> >
> > So they’re definitely vetoable.
> >
> > Also note the part about “*Once the proposal is finalized and any major
> committer dissent reconciled,*” being a prerequisite for moving a CEP to
> [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even
> be appropriate to move to a VOTE until we get to the bottom of this repair
> discussion.
> >
> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > > but the snapshot repair design is not a viable path forward. It’s
> the first iteration of a repair design. We’ve proposed a second iteration,
> and we’re open to a third iteration.
> > >
> > > I shan't be participating further in discussion, but I want to make a
> point of

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-23 Thread Runtian Liu
In the second option, we use the repair timestamp to re-update any cell or
row we want to fix in the base table. This approach is problematic because
it alters the write time of user-supplied data. Although Cassandra does not
allow users to set timestamps for LWT writes, users may still rely on the
update time. A key limitation of this approach is that it cannot fix cases
where a view cell ends up in a future state while the base table remains
correct. I now understand your point that Cassandra cannot handle this
scenario today. However, as I mentioned earlier, the important distinction
is that when this issue occurs in the base table, we accept the "incorrect"
data as valid—but this is not acceptable for materialized views, since the
source of truth (the base table) still holds the correct data.

On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston 
wrote:

> > Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> No worries, I’ll be taking some time off soon as well.
>
> > I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> I agree the first option probably isn’t the right way to go. Could you say
> a bit more about why the second option is not a good approach and which
> problems it won’t solve?
>
> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>
> Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> > First - we interpret view data with higher timestamps than the base
> table as data that’s missing from the base and replicate it into the base
> table. The timestamp of the missing data may be below the paxos timestamp
> low bound so we’d have to adjust the paxos coordination logic to allow that
> in this case. Depending on how the view got this way it may also tear
> writes to the base table, breaking the write atomicity promise.
>
> As discussed earlier, we want this MV repair mechanism to handle all edge
> cases. However, it would be difficult to design it in a way that detects
> the root cause of each mismatch and repairs it accordingly. Additionally,
> as you mentioned, this approach could introduce other issues, such as
> violating the write atomicity guarantee.
>
> > Second - If this happens it means that we’ve either lost base table data
> or paxos metadata. If that happened, we could force a base table update
> that rewrites the current base state with new timestamps making the extra
> view data removable. However this wouldn’t fix the case where the view cell
> has a timestamp from the future - although that’s not a case that C* can
> fix today either.
>
> I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> > the idea to use anti-compaction makes a lot more sense now (in principle
> - I don’t think it’s workable in practice)
>
> I have one question regarding anti-compaction. Is the main concern that
> processing too much data during anti-compaction could cause issues for the
> cluster?
>
> > I guess you could add some sort of assassin cell that is meant to remove
> a cell with a specific timestamp and value, but is otherwise invisible.
>
> The idea of the assassination cell is interesting. To prevent data from
> being incorrectly removed during the repair process, we need to ensure a
> quorum of nodes is available and agrees on the same value before repairing
> a materialized view (MV) row or cell. However, this could be very
> expensive, as it requires coordination to repair even a single row.
>
> I think there are a few key differences between MV repair and normal
> anti-entropy repair:
>
>1.
>
>For normal anti-entropy repair, there is no single source of truth.
>All replicas act as sources of truth, and eventual consistency is
>achieved by merging data across them. Even if some data is lost or bugs
>result in future timestamps, the replicas will eventually converge on the
>same (possibly incorrect) value, and that value becomes the accepted truth.
>In contrast, MV repair relies on the base table as the source of truth and
>modifies the MV if inconsistencies are detected. This means that in some
>cases, simply merging data won’t resolve the issue, since Cassandra
>resolves conflicts using timestamps and there’s no guarantee the base table
>will always "win" unless we change the MV merging logic—which I’ve been
>trying to avoid.
> 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Runtian Liu
;> be fully contained within a single node as the topology changes, some data
>> scans may be wasted.
>> > Snapshots: No extra data scan
>>
>> This isn’t really the whole story. The amount of wasted scans on index
>> repairs is negligible. If a difference is detected with snapshot repairs
>> though, you have to read the entire partition from both the view and base
>> table to calculate what needs to be fixed.
>>
>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>>
>> One practical aspect that isn't immediately obvious is the disk space
>> consideration for snapshots.
>>
>> When you have a table with a mixed workload using LCS or UCS with scaling
>> parameters like L10 and initiate a repair, the disk usage will increase as
>> long as the snapshot persists and the table continues to receive writes.
>> This aspect is understood and factored into the design.
>>
>> However, a more nuanced point is the necessity to maintain sufficient
>> disk headroom specifically for running repairs. This echoes the challenge
>> with STCS compaction, where enough space must be available to accommodate
>> the largest SSTables, even when they are not being actively compacted.
>>
>> For example, if a repair involves rewriting 100GB of SSTable data, you'll
>> consistently need to reserve 100GB of free space to facilitate this.
>>
>> Therefore, while the snapshot-based approach leads to variable disk space
>> utilization, operators must provision storage as if the maximum potential
>> space will be used at all times to ensure repairs can be executed.
>>
>> This introduces a rate of churn dynamic, where the write throughput
>> dictates the required extra disk space, rather than the existing on-disk
>> data volume.
>>
>> If 50% of your SSTables are rewritten during a snapshot, you would need
>> 50% free disk space. Depending on the workload, the snapshot method could
>> consume significantly more disk space than an index-based approach.
>> Conversely, for relatively static workloads, the index method might require
>> more space. It's not as straightforward as stating "No extra disk space
>> needed".
>>
>> Jon
>>
>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu  wrote:
>>
>> > Regarding your comparison between approaches, I think you also need to
>> take into account the other dimensions that have been brought up in this
>> thread. Things like minimum repair times and vulnerability to outages and
>> topology changes are the first that come to mind.
>>
>> Sure, I added a few more points.
>>
>> *Perspective*
>>
>> *Index-Based Solution*
>>
>> *Snapshot-Based Solution*
>>
>> 1. Hot path overhead
>>
>> Adds overhead in the hot path due to maintaining indexes. Extra memory
>> needed during write path and compaction.
>>
>> No impact on the hot path
>>
>> 2. Extra disk usage when repair is not running
>>
>> Requires additional disk space to store persistent indexes
>>
>> No extra disk space needed
>>
>> 3. Extra disk usage during repair
>>
>> Minimal or no additional disk usage
>>
>> Requires additional disk space for snapshots
>>
>> 4. Fine-grained repair  to deal with emergency situations / topology
>> changes
>>
>> Supports fine-grained repairs by targeting specific index ranges. This
>> allows repair to be retried on smaller data sets, enabling incremental
>> progress when repairing the entire table. This is especially helpful when
>> there are down nodes or topology changes during repair, which are common in
>> day-to-day operations.
>>
>> Coordination across all nodes is required over a long period of time. For
>> each round of repair, if all replica nodes are down or if there is a
>> topology change, the data ranges that were not covered will need to be
>> repaired in the next round.
>>
>> 5. Validating data used in reads directly
>>
>> Verifies index content, not actual data—may miss low-probability errors
>> like bit flips
>>
>> Verifies actual data content, providing stronger correctness guarantees
>>
>> 6. Extra data scan during inconsistency detection
>>
>> Since the data covered by certain indexes is not guaranteed to be fully
>> contained within a single node as the topology changes, some data scans may
>> be wasted.
>>
>> No extra data scan
>>
>> 7. The overhead of actual data repair after an inconsistency is detected
>>
>> Only indexes are streamed to 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-07 Thread Runtian Liu
t; you’re looking at 200 range scans.
> > > >
> > > > Now, you may argue that you can merge the duplicate view scans into
> a single scan while you repair all token ranges in parallel. I’m skeptical
> that’s going to be achievable in practice, but even if it is, we’re now
> talking about the view replica hypothetically doing a pairwise repair with
> every other replica in the cluster at the same time. Neither of these
> options is workable.
> > > >
> > > > Let’s take a step back though, because I think we’re getting lost in
> the weeds.
> > > >
> > > > The repair design in the CEP has some high level concepts that make
> a lot of sense, the idea of repairing a grid is really smart. However, it
> has some significant drawbacks that remain unaddressed. I want this CEP to
> succeed, and I know Jon does too, but the snapshot repair design is not a
> viable path forward. It’s the first iteration of a repair design. We’ve
> proposed a second iteration, and we’re open to a third iteration. This part
> of the CEP process is meant to identify and address shortcomings, I don’t
> think that continuing to dissect the snapshot repair design is making
> progress in that direction.
> > > >
> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > > > >  We potentially have to do it several times on each node,
> depending on the size of the range. Smaller ranges increase the size of the
> board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
> > > > > As described in the CEP example, this can be handled in a single
> round of repair. We first identify all the points in the grid that require
> repair, then perform anti-compaction and stream data based on a second scan
> over those identified points. This applies to the snapshot-based
> solution—without an index, repairing a single point in that grid requires
> scanning the entire base table partition (token range). In contrast, with
> the index-based solution—as in the example you referenced—if a large block
> of data is corrupted, even though the index is used for comparison, many
> key mismatches may occur. This can lead to random disk access to the
> original data files, which could cause performance issues. For the case you
> mentioned for snapshot based solution, it should not take months to repair
> all the data, instead one round of repair should be enough. The actual
> repair phase is split from the detection phase.
> > > > >
> > > > >
> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <
> j...@rustyrazorblade.com> wrote:
> > > > >> > This isn’t really the whole story. The amount of wasted scans
> on index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > > > >>
> > > > >> You nailed it.
> > > > >>
> > > > >> When the base table is converted to a view, and sent to the view,
> the information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
> > > > >>
> > > > >> Consider the case of a single corrupted SSTable on the view
> that's removed from the filesystem, or the data is simply missing after
> being restored from an inconsistent backup.  It presumably contains lots of
> partitions, which maps to base partitions all over the cluster, in a lot of
> different token ranges.  For every one of those ranges (hundreds, to tens
> of thousands of them given the checkerboard design), when finding the
> missing data in the base, you'll have to perform a compaction across all
> the SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be sent to the view.  The complexity
> of this operation can be looked at as O(N*M) where N and M are the number
> of ranges in the base table and the view affected by the corruption,
> respectively.  Without an index in place, finding the missing data is very
> expensive.  We potentially have to do it several times on each node,
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Runtian Liu
> In the case of a missed update, we'll have a new value and we can send a
tombstone to the view with the timestamp of the most recent update.

> then something has gone wrong and we should issue a tombstone using the
paxos repair timestamp as the tombstone timestamp.

The current MV implementation uses “strict liveness” to determine whether a
row is live. I believe that using regular tombstones during repair could
cause problems. For example, consider a base table with schema (pk, ck, v1,
v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
reason, we detect an extra row in the MV and delete it using a tombstone
with the latest update timestamp, we may run into issues. Suppose we later
update the base table’s v1 field to match the MV row we previously deleted,
and the v2 value now has an older timestamp. In that case, the previously
issued tombstone could still shadow the v2 column, which is unintended.
That is why I was asking if we are going to introduce a new kind of
tombstones. I’m not sure if this is the only edge case—there may be other
issues as well. I’m also unsure whether we should redesign the tombstone
handling for MVs, since that would involve changes to the storage engine.
To minimize impact there, the original proposal was to rebuild the affected
ranges using anti-compaction, just to be safe.

On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
wrote:

>  Extra row in MV (assuming the tombstone is gone in the base table) — how
> should we fix this?
>
>
>
> This would mean that the base table had either updated or deleted a row
> and the view didn't receive the corresponding delete.
>
> In the case of a missed update, we'll have a new value and we can send a
> tombstone to the view with the timestamp of the most recent update. Since
> timestamps issued by paxos and accord writes are always increasing
> monotonically and don't have collisions, this is safe.
>
> In the case of a row deletion, we'd also want to send a tombstone with the
> same timestamp, however since tombstones can be purged, we may not have
> that information and would have to treat it like the view has a higher
> timestamp than the base table.
>
> Inconsistency (timestamps don’t match) — it’s easy to fix when the base
> table has higher timestamps, but how do we resolve it when the MV columns
> have higher timestamps?
>
>
> There are 2 ways this could happen. First is that a write failed and paxos
> repair hasn't completed it, which is expected, and the second is a
> replication bug or base table data loss. You'd need to compare the view
> timestamp to the paxos repair history to tell which it is. If the view
> timestamp is higher than the most recent paxos repair timestamp for the
> key, then it may just be a failed write and we should do nothing. If the
> view timestamp is less than the most recent paxos repair timestamp for that
> key and higher than the base timestamp, then something has gone wrong and
> we should issue a tombstone using the paxos repair timestamp as the
> tombstone timestamp. This is safe to do because the paxos repair timestamps
> act as a low bound for ballots paxos will process, so it wouldn't be
> possible for a legitimate write to be shadowed by this tombstone.
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
>
> No, a normal tombstone would work.
>
> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>
> Okay, let’s put the efficiency discussion on hold for now. I want to make
> sure the actual repair process after detecting inconsistencies will work
> with the index-based solution.
>
> When a mismatch is detected, the MV replica will need to stream its index
> file to the base table replica. The base table will then perform a
> comparison between the two files.
>
> There are three cases we need to handle:
>
>1.
>
>Missing row in MV — this is straightforward; we can propagate the data
>to the MV.
>2.
>
>Extra row in MV (assuming the tombstone is gone in the base table) —
>how should we fix this?
>3.
>
>Inconsistency (timestamps don’t match) — it’s easy to fix when the
>base table has higher timestamps, but how do we resolve it when the MV
>columns have higher timestamps?
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
> wrote:
>
>
> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve 

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-11 Thread Runtian Liu
The current design leverages strict liveness to shadow the old view row.
When the view-indexed value changes from 'a' to 'b', no tombstone is
written; instead, the old row is marked as expired by updating its liveness
info with the timestamp of the change. If the column is later set back to
'a', the view row is re-inserted with a new, non-expired liveness info
reflecting the latest timestamp.

To delete an extra row in the materialized view (MV), we can likely use the
same approach—marking it as shadowed by updating the liveness info.
However, in the case of inconsistencies where a column in the MV has a
higher timestamp than the corresponding column in the base table, this
row-level liveness mechanism is insufficient.

Even for the case where we delete the row by marking its liveness info as
expired during repair, there are concerns. Since this introduces a data
mutation as part of the repair process, it’s unclear whether there could be
edge cases we’re missing. This approach may risk unexpected side effects if
the repair logic is not carefully aligned with write path semantics.


On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston 
wrote:

> That’s a good point, although as described I don’t think that could ever
> work properly, even in normal operation. Either we’re misunderstanding
> something, or this is a flaw in the current MV design.
>
> Assuming changing the view indexed column results in a tombstone being
> applied to the view row for the previous value, if we wrote the other base
> columns (the non view indexed ones) to the view with the same timestamps
> they have on the base, then changing the view indexed value from ‘a’ to
> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need
> to always update the column timestamps on the view to be >= the view column
> timestamp on the base
>
> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>
> > In the case of a missed update, we'll have a new value and we can send
> a tombstone to the view with the timestamp of the most recent update.
>
> > then something has gone wrong and we should issue a tombstone using the
> paxos repair timestamp as the tombstone timestamp.
>
> The current MV implementation uses “strict liveness” to determine whether
> a row is live. I believe that using regular tombstones during repair could
> cause problems. For example, consider a base table with schema (pk, ck, v1,
> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
> reason, we detect an extra row in the MV and delete it using a tombstone
> with the latest update timestamp, we may run into issues. Suppose we later
> update the base table’s v1 field to match the MV row we previously deleted,
> and the v2 value now has an older timestamp. In that case, the previously
> issued tombstone could still shadow the v2 column, which is unintended.
> That is why I was asking if we are going to introduce a new kind of
> tombstones. I’m not sure if this is the only edge case—there may be other
> issues as well. I’m also unsure whether we should redesign the tombstone
> handling for MVs, since that would involve changes to the storage engine.
> To minimize impact there, the original proposal was to rebuild the affected
> ranges using anti-compaction, just to be safe.
>
> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
> wrote:
>
>
>  Extra row in MV (assuming the tombstone is gone in the base table) — how
> should we fix this?
>
>
>
> This would mean that the base table had either updated or deleted a row
> and the view didn't receive the corresponding delete.
>
> In the case of a missed update, we'll have a new value and we can send a
> tombstone to the view with the timestamp of the most recent update. Since
> timestamps issued by paxos and accord writes are always increasing
> monotonically and don't have collisions, this is safe.
>
> In the case of a row deletion, we'd also want to send a tombstone with the
> same timestamp, however since tombstones can be purged, we may not have
> that information and would have to treat it like the view has a higher
> timestamp than the base table.
>
> Inconsistency (timestamps don’t match) — it’s easy to fix when the base
> table has higher timestamps, but how do we resolve it when the MV columns
> have higher timestamps?
>
>
> There are 2 ways this could happen. First is that a write failed and paxos
> repair hasn't completed it, which is expected, and the second is a
> replication bug or base table data loss. You'd need to compare the view
> timestamp to the paxos repair history to tell which it is. If the view
> timestamp is higher than the most recent paxos repair timestamp for the
> key, then it may just be a failed