Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Paulo Motta Thu, 15 May 2025 07:25:15 -0700

>  I think requiring a rebuild is a deal breaker for most teams.  In most
instances it would be having to also expand the cluster to handle the
additional disk requirements.  It turns an inconsistency problem into a
major operational headache that can take weeks to resolve.


Agreed. The rebuild would not be required during normal operations when the
cluster is properly maintained (ie. regular repair) - only in catastrophic
situations.   This is also the case for ordinary tables currently: if
there's data loss, then restoring from a backup is needed. This could be a
possible alternative to not require a rebuild in this extraordinary
scenario.

On Thu, May 15, 2025 at 10:14 AM Jon Haddad <[email protected]> wrote:

> I think requiring a rebuild is a deal breaker for most teams.  In most
> instances it would be having to also expand the cluster to handle the
> additional disk requirements.  It turns an inconsistency problem into a
> major operational headache that can take weeks to resolve.
>
>
>
>
>
> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <[email protected]>
> wrote:
>
>> > There's bi-directional entropy issues with MV's - either orphaned view
>> data or missing view data; that's why you kind of need a "bi-directional
>> ETL" to make sure the 2 agree with each other. While normal repair would
>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>> that's not in base table anymore" case, which afaict all base consistency
>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>> to.
>>
>> I don't think that bi-directional reconciliation should be a requirement,
>> when the base table is assumed to be the source of truth as stated in the
>> CEP doc.
>>
>> I think the main issue with the current MV implementation is that each
>> view replica is independently replicated by the base replica, before the
>> base write is acknowledged.
>>
>> This creates a correctness issue in the write path, because a view update
>> can be created for a write that was not accepted by the coordinator in the
>> following scenario:
>>
>> N=RF=3
>> CL=ONE
>> - Update U is propagated to view replica V, coordinator that is also base
>> replica B dies before accepting base table write request to client. Now U
>> exists in V but not in B.
>>
>> I think in order to address this, the view should be propagated to the
>> base replicas *after* it's accepted by all or a majority of base replicas.
>> This is where I think mutation tracking could probably help.
>>
>> I think this would ensure that as long as there's no data loss or
>> bit-rot, the base and view can be repaired independently. When there is
>> data loss or bit-rot in either the base table or the view, then it is the
>> same as 2i today: rebuild is required.
>>
>> >  It'd be correct (if operationally disappointing) to be able to just
>> say "if you have data loss in your base table you need to rebuild the
>> corresponding MV's", but the problem is operators aren't always going to
>> know when that data loss occurs. Not everything is as visible as a lost
>> quorum of replicas or blown up SSTables.
>>
>> I think there are opportunities to improve rebuild speed, assuming the
>> base table as a source of truth. For example, rebuild only subranges when
>> data-loss is detected.
>>
>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <[email protected]>
>> wrote:
>>
>>> There's bi-directional entropy issues with MV's - either orphaned view
>>> data or missing view data; that's why you kind of need a "bi-directional
>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>> that's not in base table anymore" case, which afaict all base consistency
>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>> to.
>>>
>>> It'd be correct (if operationally disappointing) to be able to just say
>>> "if you have data loss in your base table you need to rebuild the
>>> corresponding MV's", but the problem is operators aren't always going to
>>> know when that data loss occurs. Not everything is as visible as a lost
>>> quorum of replicas or blown up SSTables.
>>>
>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>
>>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>>> to say. You can’t mix normal repair and mutation reconciliation in the
>>> current incarnation of mutation tracking though, so I wouldn’t assume it
>>> would work with MVs.
>>>
>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>
>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>> (just the MV against the other nodes) resolve the issue?
>>>
>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <[email protected]>
>>> wrote:
>>>
>>>
>>> Mutation tracking is definitely an approach you could take for MVs.
>>> Mutation reconciliation could be extended to ensure all changes have been
>>> replicated to the views. When a base table received a mutation w/ an id it
>>> would generate a view update. If you block marking a given mutation id as
>>> reconciled until it’s been fully replicated to the base table and its view
>>> updates have been fully replicated to the views, then all view updates will
>>> eventually be applied as part of the log reconciliation process.
>>>
>>> A mutation tracking implementation would also allow you to be more
>>> flexible with the types of consistency levels you can work with, allowing
>>> users to do things like use LOCAL_QUORUM without leaving themselves open to
>>> introducing view inconsistencies.
>>>
>>> That would more or less eliminate the need for any MV repair in normal
>>> usage, but wouldn't address how to repair issues caused by bugs or data
>>> loss, though you may be able to do something with comparing the latest
>>> mutation ids for the base tables and its view ranges.
>>>
>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>>
>>> I don't see mutation tracking [1] mentioned in this thread or in the
>>> CEP-48 description. Not sure this would fit into the scope of this
>>> initial CEP, but I have a feeling that mutation tracking could be
>>> potentially helpful to reconcile base tables and views ?
>>>
>>> For example, when both base and view updates are acknowledged then this
>>> could be somehow persisted in the view sstables mutation tracking
>>> summary[2] or similar metadata ? Then these updates would be skipped during
>>> view repair, considerably reducing the amount of work needed, since only
>>> un-acknowledged views updates would need to be reconciled.
>>>
>>> [1] -
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>>>
>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <[email protected]>
>>> wrote:
>>>
>>> > - The first thing I notice is that we're talking about repairing the
>>> entire table across the entire cluster all in one go.  It's been a *long*
>>> time since I tried to do a full repair of an entire table without using
>>> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
>>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>>> take even in the best case scenario?
>>>
>>> I haven't checked the CEP yet so I may be missing out something but I
>>> think this effort doesn't need to be conflated with dense node support, to
>>> make this more approachable. I think prospective users would be OK with
>>> overprovisioning to make this feasible if needed. We could perhaps have
>>> size guardrails that limit the maximum table size per node when MVs are
>>> enabled. Ideally we should make it work for dense nodes if possible, but
>>> this shouldn't be a reason not to support the feature if it can be made to
>>> work reasonably with more resources.
>>>
>>> I think the main issue with the current MV is about correctness, and the
>>> ultimate goal of the CEP must be to provide correctness guarantees, even if
>>> it has an inevitable performance hit. I think that the performance of the
>>> repair process is definitely an important consideration and it would be
>>> helpful to have some benchmarks to have an idea of how long this repair
>>> process would take for lightweight and denser tables.
>>>
>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <[email protected]>
>>> wrote:
>>>
>>> I've got several concerns around this repair process.
>>>
>>> - The first thing I notice is that we're talking about repairing the
>>> entire table across the entire cluster all in one go.  It's been a *long*
>>> time since I tried to do a full repair of an entire table without using
>>> sub-ranges.  Is anyone here even doing that with clusters of non trivial
>>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>>> take even in the best case scenario?
>>>
>>> - Even in a scenario where sub-range repair is supported, you'd have to
>>> scan *every* sstable on the base table in order to construct the a merkle
>>> tree, as we don't know in advance which SSTables contain the ranges that
>>> the MV will.  That means a subrange repair would have to do a *ton* of IO.
>>> Anyone who's mis-configured a sub-range incremental repair to use too many
>>> ranges will probably be familiar with how long it can take to anti-compact
>>> a bunch of SSTables.  With MV sub-range repair, we'd have even more
>>> overhead, because we'd have to read in every SSTable, every time.  If we do
>>> 10 subranges, we'll do 10x the IO of a normal repair.  I don't think this
>>> is practical.
>>>
>>> - Merkle trees make sense when you're comparing tables with the same
>>> partition key, but I don't think they do when you're transforming a base
>>> table to a view.  When there's a mis-match, what's transferred?  We have a
>>> range of data in the MV, but now we have to go find that from the base
>>> table.  That means the merkle tree needs to not just track the hashes and
>>> ranges, but the original keys it was transformed from, in order to go find
>>> all of the matching partitions in that mis-matched range.  Either that or
>>> we end up rescanning the entire dataset in order to find the mismatches.
>>>
>>> Jon
>>>
>>>
>>>
>>>
>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <[email protected]> wrote:
>>>
>>> > Looking at the details of the CEP it seems to describe Paxos as
>>> PaxosV1, but PaxosV2 works slightly differently (it can read during the
>>> prepare phase). I assume that supporting Paxos means supporting both V1 and
>>> V2 for materialized views?
>>> We are going to support Paxos V2. The CEP is not clear on that, we add
>>> this to clarify that.
>>>
>>> It looks like the online portion is now fairly well understood.  For the
>>> offline repair part, I see two main concerns: one around the scalability of
>>> the proposed approach, and another regarding how it handles tombstones.
>>>
>>> *Scalability:*
>>> I have added a *section*
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample>
>>> in the CEP with an example to compare full repair and the proposed MV
>>> repair, the overall scalability should not be a problem.
>>>
>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes,
>>> where each node owns one token. The base table uses (pk, ck) as its primary
>>> key, while the materialized view (MV) uses (ck, pk) as its primary key.
>>> Both tables include a value column v, which allows us to correlate rows
>>> between them. The dataset consists of 16 records, distributed as follows:
>>>
>>>
>>> *Base table*
>>> (pk, ck, v)
>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4
>>>
>>>
>>>
>>> *Materialized view*
>>> (ck, pk, v)
>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3
>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4
>>>
>>>
>>> The chart below compares one round of full repair with one round of MV
>>> repair. As shown, both scan the same total number of rows. However, MV
>>> repair has higher time complexity because its Merkle tree processes each
>>> row more intensively. To avoid all nodes scanning the entire table
>>> simultaneously, MV repair should use a snapshot-based approach, similar to
>>> normal repair with the --sequential option. Time complexity increase
>>> compare to full repair can be found in the "Complexity and Memory
>>> Management" section.
>>>
>>>
>>> n: number of rows
>>>
>>> d: depth of one Merkle tree for MV repair
>>>
>>> d': depth of one Merkle tree for full repair
>>>
>>> r: number of split ranges
>>>
>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r.
>>>
>>> We can see that the space complexity is the same, while MV repair has
>>> higher time complexity. However, this should not pose a significant issue
>>> in production, as the Merkle tree depth and the number of split ranges are
>>> typically not large.
>>>
>>>
>>> 1 Round Merkle Tree Building Complexity
>>> Full Repair
>>> MV Repair
>>> Time complexity O(n) O(n*d*log(r))
>>> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r)
>>>
>>> *Tombstone:*
>>>
>>> The current proposal focuses on rebuilding the MV for a granular token
>>> range where a mismatch is detected, rather than rebuilding the entire MV
>>> token range. Since the MV is treated as a regular table, standard full or
>>> incremental repair processes should still apply to both the base and MV
>>> tables to keep their replicas in sync.
>>>
>>> Regarding tombstones, if we introduce special tombstone types or
>>> handling mechanisms for the MV table, we may be able to support tombstone
>>> synchronization between the base table and the MV. I plan to spend more
>>> time exploring whether we can introduce changes to the base table that
>>> enable this synchronization.
>>>
>>>
>>>
>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia <
>>> [email protected]> wrote:
>>>
>>> >Like something doesn't add up here because if it always includes the
>>> base table's primary key columns that means
>>>
>>> The requirement for materialized views (MVs) to include the base table's
>>> primary key appears to be primarily a syntactic constraint specific to
>>> Apache Cassandra. For instance, in DynamoDB, the DDL for defining a Global
>>> Secondary Index does not mandate inclusion of the base table's primary key.
>>> This suggests that the syntax requirement in Cassandra could potentially be
>>> relaxed in the future (outside the scope of this CEP). As Benedict noted,
>>> the base table's primary key is optional when querying a materialized view.
>>>
>>> Jaydeep
>>>
>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <[email protected]>
>>> wrote:
>>>
>>>
>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation
>>> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file.
>>> > Repair isnt’ just about “have all transaction commits landed”. It’s
>>> “is the data correct N days after it’s written”.
>>>
>>> Don't forget about restoring from a backup.
>>>
>>> Is there a way we could do some sort of hybrid compaction + incremental
>>> repair?  Maybe have the MV verify it's view while it's compacting, and when
>>> it's done, mark the view's SSTable as repaired?  Then the repair process
>>> would only need to do a MV to MV repair.
>>>
>>> Jon
>>>
>>>
>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
>>> [email protected]> wrote:
>>>
>>> Like something doesn't add up here because if it always includes the
>>> base table's primary key columns that means they could be storage attached
>>> by just forbidding additional columns and there doesn't seem to be much
>>> utility in including additional columns in the primary key?
>>>
>>>
>>> You can re-order the keys, and they only need to be a part of the
>>> primary key not the partition key. I think you can specify an arbitrary
>>> order to the keys also, so you can change the effective sort order. So, the
>>> basic idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>>>
>>> This is basically a global index, with the restriction on single columns
>>> as keys only because we cannot cheaply read-before-write for eventually
>>> consistent operations. This restriction can easily be relaxed for Paxos and
>>> Accord based implementations, which can also safely include additional keys.
>>>
>>> That said, I am not at all sure why they are called materialised views
>>> if we don’t support including any other data besides the lookup column and
>>> the primary key. We should really rename them once they work, both to make
>>> some sense and to break with the historical baggage.
>>>
>>> I think this can be represented as a tombstone which can always be
>>> fetched from the base table on read or maybe some other arrangement? I
>>> agree it can't feasibly be represented as an enumeration of the deletions
>>> at least not synchronously and doing it async has its own problems.
>>>
>>> If the base table must be read on read of an index/view, then I think
>>> this proposal is approximately linearizable for the view as well (though, I
>>> do not at all warrant this statement). You still need to propagate this
>>> eventually so that the views can cleanup. This also makes reads 2RT on
>>> read, which is rather costly.
>>>
>>> On 12 May 2025, at 16:10, Ariel Weisberg <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I think it's worth taking a step back and looking at the current MV
>>> restrictions which are pretty onerous.
>>>
>>> A view must have a primary key and that primary key must conform to the
>>> following restrictions:
>>>
>>>    - it must contain all the primary key columns of the base table.
>>>    This ensures that every row of the view correspond to exactly one row of
>>>    the base table.
>>>    - it can only contain a single column that is not a primary key
>>>    column in the base table.
>>>
>>> At that point what exactly is the value in including anything except the
>>> original primary key in the MV's primary key columns unless you are using
>>> an ordered partitioner so you can iterate based on the leading primary key
>>> columns?
>>>
>>> Like something doesn't add up here because if it always includes the
>>> base table's primary key columns that means they could be storage attached
>>> by just forbidding additional columns and there doesn't seem to be much
>>> utility in including additional columns in the primary key?
>>>
>>> I'm not that clear on how much better it is to look something up in the
>>> MV vs just looking at the base table or some non-materialized view of it.
>>> How exactly are these MVs supposed to be used and what value do they
>>> provide?
>>>
>>> Jeff Jirsa wrote:
>>>
>>> There’s 2 things in this proposal that give me a lot of pause.
>>>
>>>
>>> Runtian Liu pointed out that the CEP is sort of divided into two parts.
>>> The first is the online part which is making reads/writes to MVs safer and
>>> more reliable using a transaction system. The second is offline which is
>>> repair.
>>>
>>> The story for the online portion I think is quite strong and worth
>>> considering on its own merits.
>>>
>>> The offline portion (repair) sounds a little less feasible to run in
>>> production, but I also think that MVs without any mechanism for checking
>>> their consistency are not viable to run in production. So it's kind of pay
>>> for what you use in terms of the feature?
>>>
>>> It's definitely worth thinking through if there is a way to fix one side
>>> of this equation so it works better.
>>>
>>> David Capwell wrote:
>>>
>>> As far as I can tell, being based off Accord means you don’t need to
>>> care about repair, as Accord will manage the consistency for you; you can’t
>>> get out of sync.
>>>
>>> I think a baseline requirement in C* for something to be in production
>>> is to be able to run preview repair and validate that the transaction
>>> system or any other part of Cassandra hasn't made a mistake. Divergence can
>>> have many sources including Accord.
>>>
>>> Runtian Liu wrote:
>>>
>>> For the example David mentioned, LWT cannot support. Since LWTs operate
>>> on a single token, we’ll need to restrict base-table updates to one
>>> partition—and ideally one row—at a time. A current MV base-table command
>>> can delete an entire partition, but doing so might touch hundreds of MV
>>> partitions, making consistency guarantees impossible.
>>>
>>> I think this can be represented as a tombstone which can always be
>>> fetched from the base table on read or maybe some other arrangement? I
>>> agree it can't feasibly be represented as an enumeration of the deletions
>>> at least not synchronously and doing it async has its own problems.
>>>
>>> Ariel
>>>
>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>>>
>>>
>>>
>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <[email protected]> wrote:
>>>
>>>
>>> I am *big* fan of getting repair really working with MVs. It does seem
>>> problematic that the number of merkle trees will be equal to the number of
>>> ranges in the cluster and repair of MVs would become an all node
>>> operation.  How would down nodes be handled and how many nodes would
>>> simultaneously working to validate a given base table range at once? How
>>> many base table ranges could simultaneously be repairing MVs?
>>>
>>> If a row containing a column that creates an MV partition is deleted,
>>> and the MV isn't updated, then how does the merkle tree approach propagate
>>> the deletion to the MV? The CEP says that anti-compaction would remove
>>> extra rows, but I am not clear on how that works. When is anti-compaction
>>> performed in the repair process and what is/isn't included in the outputs?
>>>
>>>
>>>
>>> I thought about these two points last night after I sent my email.
>>>
>>> There’s 2 things in this proposal that give me a lot of pause.
>>>
>>> One is the lack of tombstones / deletions in the merle trees, which
>>> makes properly dealing with writes/deletes/inconsistency very hard (afaict)
>>>
>>> The second is the reality that repairing a single partition in the base
>>> table may repair all hosts/ranges in the MV table, and vice versa.
>>> Basically scanning either base or MV is effectively scanning the whole
>>> cluster (modulo what you can avoid in the clean/dirty repaired sets). This
>>> makes me really, really concerned with how it scales, and how likely it is
>>> to be able to schedule automatically without blowing up.
>>>
>>> The paxos vs accord comments so far are interesting in that I think both
>>> could be made to work, but I am very concerned about how the merkle tree
>>> comparisons are likely to work with wide partitions leading to massive
>>> fanout in ranges.
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to