Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Josh McKenzie Thu, 15 May 2025 08:32:01 -0700

> I think in order to address this, the view should be propagated to the base 
> replicas *after* it's accepted by all or a majority of base replicas. This is 
> where I think mutation tracking could probably help.
Yeah, the idea of "don't reflect in the MV until you hit the CL the user 
requested for the base table". Introduces disjoint risk if you have coordinator 
death mid-write where replicas got base-data but that 2nd step didn't take 
place; think that's why Runtien et. al are looking at paxos repair picking up 
those pieces for you after the fact to get you back into consistency. Mutation 
tracking and Accord both have similar guarantees in this space.


> I think this would ensure that as long as there's no data loss or bit-rot, 
> the base and view can be repaired independently. When there is data loss or 
> bit-rot in either the base table or the view, then it is the same as 2i 
> today: rebuild is required.
And the repair as proposed in the CEP should resolve the bitrot and bug 
dataloss case I think. Certainly has much higher time complexity but the 
bounding of memory complexity to be comparable with regular repair doesn't 
strike me as a dealbreaker.

On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
> >  I think requiring a rebuild is a deal breaker for most teams.  In most 
> > instances it would be having to also expand the cluster to handle the 
> > additional disk requirements.  It turns an inconsistency problem into a 
> > major operational headache that can take weeks to resolve.
> 
> Agreed. The rebuild would not be required during normal operations when the 
> cluster is properly maintained (ie. regular repair) - only in catastrophic 
> situations.   This is also the case for ordinary tables currently: if there's 
> data loss, then restoring from a backup is needed. This could be a possible 
> alternative to not require a rebuild in this extraordinary scenario.
> 
> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com> wrote:
>> I think requiring a rebuild is a deal breaker for most teams.  In most 
>> instances it would be having to also expand the cluster to handle the 
>> additional disk requirements.  It turns an inconsistency problem into a 
>> major operational headache that can take weeks to resolve.
>> 
>> 
>> 
>> 
>> 
>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com> wrote:
>>> > There's bi-directional entropy issues with MV's - either orphaned view 
>>> > data or missing view data; that's why you kind of need a "bi-directional 
>>> > ETL" to make sure the 2 agree with each other. While normal repair would 
>>> > resolve the "missing data in MV" case, it wouldn't resolve the "data in 
>>> > MV that's not in base table anymore" case, which afaict all base 
>>> > consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) 
>>> > are vulnerable to.
>>> 
>>> I don't think that bi-directional reconciliation should be a requirement, 
>>> when the base table is assumed to be the source of truth as stated in the 
>>> CEP doc.
>>> 
>>> I think the main issue with the current MV implementation is that each view 
>>> replica is independently replicated by the base replica, before the base 
>>> write is acknowledged.
>>> 
>>> This creates a correctness issue in the write path, because a view update 
>>> can be created for a write that was not accepted by the coordinator in the 
>>> following scenario:
>>> 
>>> N=RF=3
>>> CL=ONE
>>> - Update U is propagated to view replica V, coordinator that is also base 
>>> replica B dies before accepting base table write request to client. Now U 
>>> exists in V but not in B.
>>> 
>>> I think in order to address this, the view should be propagated to the base 
>>> replicas *after* it's accepted by all or a majority of base replicas. This 
>>> is where I think mutation tracking could probably help.
>>> 
>>> I think this would ensure that as long as there's no data loss or bit-rot, 
>>> the base and view can be repaired independently. When there is data loss or 
>>> bit-rot in either the base table or the view, then it is the same as 2i 
>>> today: rebuild is required.
>>> 
>>> >  It'd be correct (if operationally disappointing) to be able to just say 
>>> > "if you have data loss in your base table you need to rebuild the 
>>> > corresponding MV's", but the problem is operators aren't always going to 
>>> > know when that data loss occurs. Not everything is as visible as a lost 
>>> > quorum of replicas or blown up SSTables.
>>> 
>>> I think there are opportunities to improve rebuild speed, assuming the base 
>>> table as a source of truth. For example, rebuild only subranges when 
>>> data-loss is detected.
>>> 
>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org> wrote:
>>>> __
>>>> There's bi-directional entropy issues with MV's - either orphaned view 
>>>> data or missing view data; that's why you kind of need a "bi-directional 
>>>> ETL" to make sure the 2 agree with each other. While normal repair would 
>>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV 
>>>> that's not in base table anymore" case, which afaict all base consistency 
>>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable 
>>>> to.
>>>> 
>>>> It'd be correct (if operationally disappointing) to be able to just say 
>>>> "if you have data loss in your base table you need to rebuild the 
>>>> corresponding MV's", but the problem is operators aren't always going to 
>>>> know when that data loss occurs. Not everything is as visible as a lost 
>>>> quorum of replicas or blown up SSTables.
>>>> 
>>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>>> Maybe, I’m not really familiar enough with how “classic” MV repair works 
>>>>> to say. You can’t mix normal repair and mutation reconciliation in the 
>>>>> current incarnation of mutation tracking though, so I wouldn’t assume it 
>>>>> would work with MVs.
>>>>> 
>>>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair 
>>>>>> (just the MV against the other nodes) resolve the issue?
>>>>>> 
>>>>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> 
>>>>>> wrote:
>>>>>>> __
>>>>>>> Mutation tracking is definitely an approach you could take for MVs. 
>>>>>>> Mutation reconciliation could be extended to ensure all changes have 
>>>>>>> been replicated to the views. When a base table received a mutation w/ 
>>>>>>> an id it would generate a view update. If you block marking a given 
>>>>>>> mutation id as reconciled until it’s been fully replicated to the base 
>>>>>>> table and its view updates have been fully replicated to the views, 
>>>>>>> then all view updates will eventually be applied as part of the log 
>>>>>>> reconciliation process.
>>>>>>> 
>>>>>>> A mutation tracking implementation would also allow you to be more 
>>>>>>> flexible with the types of consistency levels you can work with, 
>>>>>>> allowing users to do things like use LOCAL_QUORUM without leaving 
>>>>>>> themselves open to introducing view inconsistencies.
>>>>>>> 
>>>>>>> That would more or less eliminate the need for any MV repair in normal 
>>>>>>> usage, but wouldn't address how to repair issues caused by bugs or data 
>>>>>>> loss, though you may be able to do something with comparing the latest 
>>>>>>> mutation ids for the base tables and its view ranges.
>>>>>>> 
>>>>>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>>>>>>> I don't see mutation tracking [1] mentioned in this thread or in the 
>>>>>>>> CEP-48 description. Not sure this would fit into the scope of this 
>>>>>>>> initial CEP, but I have a feeling that mutation tracking could be 
>>>>>>>> potentially helpful to reconcile base tables and views ?
>>>>>>>> 
>>>>>>>> For example, when both base and view updates are acknowledged then 
>>>>>>>> this could be somehow persisted in the view sstables mutation tracking 
>>>>>>>> summary[2] or similar metadata ? Then these updates would be skipped 
>>>>>>>> during view repair, considerably reducing the amount of work needed, 
>>>>>>>> since only un-acknowledged views updates would need to be reconciled.
>>>>>>>> 
>>>>>>>> [1] - 
>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>>>>>>>  
>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>>>>>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>>>>>>>> 
>>>>>>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta 
>>>>>>>> <pauloricard...@gmail.com> wrote:
>>>>>>>>> > - The first thing I notice is that we're talking about repairing 
>>>>>>>>> > the entire table across the entire cluster all in one go.  It's 
>>>>>>>>> > been a *long* time since I tried to do a full repair of an entire 
>>>>>>>>> > table without using sub-ranges.  Is anyone here even doing that 
>>>>>>>>> > with clusters of non-trivial size?  How long does a full repair of 
>>>>>>>>> > a 100 node cluster with 5TB / node take even in the best case 
>>>>>>>>> > scenario?
>>>>>>>>> 
>>>>>>>>> I haven't checked the CEP yet so I may be missing out something but I 
>>>>>>>>> think this effort doesn't need to be conflated with dense node 
>>>>>>>>> support, to make this more approachable. I think prospective users 
>>>>>>>>> would be OK with overprovisioning to make this feasible if needed. We 
>>>>>>>>> could perhaps have size guardrails that limit the maximum table size 
>>>>>>>>> per node when MVs are enabled. Ideally we should make it work for 
>>>>>>>>> dense nodes if possible, but this shouldn't be a reason not to 
>>>>>>>>> support the feature if it can be made to work reasonably with more 
>>>>>>>>> resources.
>>>>>>>>> 
>>>>>>>>> I think the main issue with the current MV is about correctness, and 
>>>>>>>>> the ultimate goal of the CEP must be to provide correctness 
>>>>>>>>> guarantees, even if it has an inevitable performance hit. I think 
>>>>>>>>> that the performance of the repair process is definitely an important 
>>>>>>>>> consideration and it would be helpful to have some benchmarks to have 
>>>>>>>>> an idea of how long this repair process would take for lightweight 
>>>>>>>>> and denser tables.
>>>>>>>>> 
>>>>>>>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> 
>>>>>>>>> wrote:
>>>>>>>>>> I've got several concerns around this repair process.
>>>>>>>>>> 
>>>>>>>>>> - The first thing I notice is that we're talking about repairing the 
>>>>>>>>>> entire table across the entire cluster all in one go.  It's been a 
>>>>>>>>>> *long* time since I tried to do a full repair of an entire table 
>>>>>>>>>> without using sub-ranges.  Is anyone here even doing that with 
>>>>>>>>>> clusters of non trivial size?  How long does a full repair of a 100 
>>>>>>>>>> node cluster with 5TB / node take even in the best case scenario?
>>>>>>>>>> 
>>>>>>>>>> - Even in a scenario where sub-range repair is supported, you'd have 
>>>>>>>>>> to scan *every* sstable on the base table in order to construct the 
>>>>>>>>>> a merkle tree, as we don't know in advance which SSTables contain 
>>>>>>>>>> the ranges that the MV will.  That means a subrange repair would 
>>>>>>>>>> have to do a *ton* of IO.  Anyone who's mis-configured a sub-range 
>>>>>>>>>> incremental repair to use too many ranges will probably be familiar 
>>>>>>>>>> with how long it can take to anti-compact a bunch of SSTables.  With 
>>>>>>>>>> MV sub-range repair, we'd have even more overhead, because we'd have 
>>>>>>>>>> to read in every SSTable, every time.  If we do 10 subranges, we'll 
>>>>>>>>>> do 10x the IO of a normal repair.  I don't think this is practical.
>>>>>>>>>> 
>>>>>>>>>> - Merkle trees make sense when you're comparing tables with the same 
>>>>>>>>>> partition key, but I don't think they do when you're transforming a 
>>>>>>>>>> base table to a view.  When there's a mis-match, what's transferred? 
>>>>>>>>>>  We have a range of data in the MV, but now we have to go find that 
>>>>>>>>>> from the base table.  That means the merkle tree needs to not just 
>>>>>>>>>> track the hashes and ranges, but the original keys it was 
>>>>>>>>>> transformed from, in order to go find all of the matching partitions 
>>>>>>>>>> in that mis-matched range.  Either that or we end up rescanning the 
>>>>>>>>>> entire dataset in order to find the mismatches. 
>>>>>>>>>> 
>>>>>>>>>> Jon
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>> > Looking at the details of the CEP it seems to describe Paxos as 
>>>>>>>>>>> > PaxosV1, but PaxosV2 works slightly differently (it can read 
>>>>>>>>>>> > during the prepare phase). I assume that supporting Paxos means 
>>>>>>>>>>> > supporting both V1 and V2 for materialized views?
>>>>>>>>>>> We are going to support Paxos V2. The CEP is not clear on that, we 
>>>>>>>>>>> add this to clarify that.
>>>>>>>>>>> It looks like the online portion is now fairly well understood.  
>>>>>>>>>>> For the offline repair part, I see two main concerns: one around 
>>>>>>>>>>> the scalability of the proposed approach, and another regarding how 
>>>>>>>>>>> it handles tombstones.
>>>>>>>>>>> 
>>>>>>>>>>> *Scalability:**
*I have added a _section_ 
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample>
 in the CEP with an example to compare full repair and the proposed MV repair, 
the overall scalability should not be a problem.
>>>>>>>>>>> 
>>>>>>>>>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 
>>>>>>>>>>> nodes, where each node owns one token. The base table uses (pk, ck) 
>>>>>>>>>>> as its primary key, while the materialized view (MV) uses (ck, pk) 
>>>>>>>>>>> as its primary key. Both tables include a value column v, which 
>>>>>>>>>>> allows us to correlate rows between them. The dataset consists of 
>>>>>>>>>>> 16 records, distributed as follows:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> *Base table**
*(pk, ck, v)
>>>>>>>>>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4)  // N1
>>>>>>>>>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8)  // N2
>>>>>>>>>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12)  // N3
>>>>>>>>>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16)  // N4
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> *Materialized view**
*(ck, pk, v)
>>>>>>>>>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13)  // N1
>>>>>>>>>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14)  // N2
>>>>>>>>>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15)  // N3
>>>>>>>>>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16)  // N4
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The chart below compares one round of full repair with one round of 
>>>>>>>>>>> MV repair. As shown, both scan the same total number of rows. 
>>>>>>>>>>> However, MV repair has higher time complexity because its Merkle 
>>>>>>>>>>> tree processes each row more intensively. To avoid all nodes 
>>>>>>>>>>> scanning the entire table simultaneously, MV repair should use a 
>>>>>>>>>>> snapshot-based approach, similar to normal repair with the 
>>>>>>>>>>> `--sequential` option. Time complexity increase compare to full 
>>>>>>>>>>> repair can be found in the "Complexity and Memory Management" 
>>>>>>>>>>> section.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> n: number of rows
>>>>>>>>>>> 
>>>>>>>>>>> d: depth of one Merkle tree for MV repair
>>>>>>>>>>> 
>>>>>>>>>>> d': depth of one Merkle tree for full repair
>>>>>>>>>>> 
>>>>>>>>>>> r: number of split ranges
>>>>>>>>>>> 
>>>>>>>>>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * 
>>>>>>>>>>> r. 
>>>>>>>>>>> 
>>>>>>>>>>> We can see that the space complexity is the same, while MV repair 
>>>>>>>>>>> has higher time complexity. However, this should not pose a 
>>>>>>>>>>> significant issue in production, as the Merkle tree depth and the 
>>>>>>>>>>> number of split ranges are typically not large.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 1 Round Merkle Tree Building Complexity
>>>>>>>>>>> Full Repair
>>>>>>>>>>> MV Repair
>>>>>>>>>>> Time complexity     O(n)    O(n*d*log(r))
>>>>>>>>>>> Space complexity    O((2^d')*r)     O((2^d)*r^2) = O((2^d')*r)
>>>>>>>>>>> *Tombstone:*
>>>>>>>>>>> 
>>>>>>>>>>> The current proposal focuses on rebuilding the MV for a granular 
>>>>>>>>>>> token range where a mismatch is detected, rather than rebuilding 
>>>>>>>>>>> the entire MV token range. Since the MV is treated as a regular 
>>>>>>>>>>> table, standard full or incremental repair processes should still 
>>>>>>>>>>> apply to both the base and MV tables to keep their replicas in sync.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding tombstones, if we introduce special tombstone types or 
>>>>>>>>>>> handling mechanisms for the MV table, we may be able to support 
>>>>>>>>>>> tombstone synchronization between the base table and the MV. I plan 
>>>>>>>>>>> to spend more time exploring whether we can introduce changes to 
>>>>>>>>>>> the base table that enable this synchronization.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia 
>>>>>>>>>>> <chovatia.jayd...@gmail.com> wrote:
>>>>>>>>>>>> >Like something doesn't add up here because if it always includes 
>>>>>>>>>>>> >the base table's primary key columns that means
>>>>>>>>>>>> The requirement for materialized views (MVs) to include the base 
>>>>>>>>>>>> table's primary key appears to be primarily a syntactic constraint 
>>>>>>>>>>>> specific to Apache Cassandra. For instance, in DynamoDB, the DDL 
>>>>>>>>>>>> for defining a Global Secondary Index does not mandate inclusion 
>>>>>>>>>>>> of the base table's primary key. This suggests that the syntax 
>>>>>>>>>>>> requirement in Cassandra could potentially be relaxed in the 
>>>>>>>>>>>> future (outside the scope of this CEP). As Benedict noted, the 
>>>>>>>>>>>> base table's primary key is optional when querying a materialized 
>>>>>>>>>>>> view.
>>>>>>>>>>>> 
>>>>>>>>>>>> Jaydeep
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad 
>>>>>>>>>>>> <j...@rustyrazorblade.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> > Or compaction hasn’t made a mistake, or cell merge 
>>>>>>>>>>>>> > reconciliation hasn’t made a mistake, or volume bitrot hasn’t 
>>>>>>>>>>>>> > caused you to lose a file.
>>>>>>>>>>>>> > Repair isnt’ just about “have all transaction commits landed”. 
>>>>>>>>>>>>> > It’s “is the data correct N days after it’s written”.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Don't forget about restoring from a backup.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there a way we could do some sort of hybrid compaction + 
>>>>>>>>>>>>> incremental repair?  Maybe have the MV verify it's view while 
>>>>>>>>>>>>> it's compacting, and when it's done, mark the view's SSTable as 
>>>>>>>>>>>>> repaired?  Then the repair process would only need to do a MV to 
>>>>>>>>>>>>> MV repair.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jon
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith 
>>>>>>>>>>>>> <bened...@apache.org> wrote:
>>>>>>>>>>>>>>> Like something doesn't add up here because if it always 
>>>>>>>>>>>>>>> includes the base table's primary key columns that means they 
>>>>>>>>>>>>>>> could be storage attached by just forbidding additional columns 
>>>>>>>>>>>>>>> and there doesn't seem to be much utility in including 
>>>>>>>>>>>>>>> additional columns in the primary key?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can re-order the keys, and they only need to be a part of 
>>>>>>>>>>>>>> the primary key not the partition key. I think you can specify 
>>>>>>>>>>>>>> an arbitrary order to the keys also, so you can change the 
>>>>>>>>>>>>>> effective sort order. So, the basic idea is you stipulate 
>>>>>>>>>>>>>> something like PRIMARY KEY ((v1),(ck1,pk1)).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is basically a global index, with the restriction on single 
>>>>>>>>>>>>>> columns as keys only because we cannot cheaply read-before-write 
>>>>>>>>>>>>>> for eventually consistent operations. This restriction can 
>>>>>>>>>>>>>> easily be relaxed for Paxos and Accord based implementations, 
>>>>>>>>>>>>>> which can also safely include additional keys.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That said, I am not at all sure why they are called materialised 
>>>>>>>>>>>>>> views if we don’t support including any other data besides the 
>>>>>>>>>>>>>> lookup column and the primary key. We should really rename them 
>>>>>>>>>>>>>> once they work, both to make some sense and to break with the 
>>>>>>>>>>>>>> historical baggage.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think this can be represented as a tombstone which can always 
>>>>>>>>>>>>>>> be fetched from the base table on read or maybe some other 
>>>>>>>>>>>>>>> arrangement? I agree it can't feasibly be represented as an 
>>>>>>>>>>>>>>> enumeration of the deletions at least not synchronously and 
>>>>>>>>>>>>>>> doing it async has its own problems.
>>>>>>>>>>>>>> If the base table must be read on read of an index/view, then I 
>>>>>>>>>>>>>> think this proposal is approximately linearizable for the view 
>>>>>>>>>>>>>> as well (though, I do not at all warrant this statement). You 
>>>>>>>>>>>>>> still need to propagate this eventually so that the views can 
>>>>>>>>>>>>>> cleanup. This also makes reads 2RT on read, which is rather 
>>>>>>>>>>>>>> costly.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think it's worth taking a step back and looking at the 
>>>>>>>>>>>>>>> current MV restrictions which are pretty onerous.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A view must have a primary key and that primary key must 
>>>>>>>>>>>>>>> conform to the following restrictions:
>>>>>>>>>>>>>>>  • it must contain all the primary key columns of the base 
>>>>>>>>>>>>>>> table. This ensures that every row of the view correspond to 
>>>>>>>>>>>>>>> exactly one row of the base table.
>>>>>>>>>>>>>>>  • it can only contain a single column that is not a primary 
>>>>>>>>>>>>>>> key column in the base table.
>>>>>>>>>>>>>>> At that point what exactly is the value in including anything 
>>>>>>>>>>>>>>> except the original primary key in the MV's primary key columns 
>>>>>>>>>>>>>>> unless you are using an ordered partitioner so you can iterate 
>>>>>>>>>>>>>>> based on the leading primary key columns?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Like something doesn't add up here because if it always 
>>>>>>>>>>>>>>> includes the base table's primary key columns that means they 
>>>>>>>>>>>>>>> could be storage attached by just forbidding additional columns 
>>>>>>>>>>>>>>> and there doesn't seem to be much utility in including 
>>>>>>>>>>>>>>> additional columns in the primary key?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm not that clear on how much better it is to look something 
>>>>>>>>>>>>>>> up in the MV vs just looking at the base table or some 
>>>>>>>>>>>>>>> non-materialized view of it. How exactly are these MVs supposed 
>>>>>>>>>>>>>>> to be used and what value do they provide?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jeff Jirsa wrote:
>>>>>>>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Runtian Liu pointed out that the CEP is sort of divided into 
>>>>>>>>>>>>>>> two parts. The first is the online part which is making 
>>>>>>>>>>>>>>> reads/writes to MVs safer and more reliable using a transaction 
>>>>>>>>>>>>>>> system. The second is offline which is repair.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The story for the online portion I think is quite strong and 
>>>>>>>>>>>>>>> worth considering on its own merits.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The offline portion (repair) sounds a little less feasible to 
>>>>>>>>>>>>>>> run in production, but I also think that MVs without any 
>>>>>>>>>>>>>>> mechanism for checking their consistency are not viable to run 
>>>>>>>>>>>>>>> in production. So it's kind of pay for what you use in terms of 
>>>>>>>>>>>>>>> the feature?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It's definitely worth thinking through if there is a way to fix 
>>>>>>>>>>>>>>> one side of this equation so it works better.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> David Capwell wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As far as I can tell, being based off Accord means you don’t 
>>>>>>>>>>>>>>>> need to care about repair, as Accord will manage the 
>>>>>>>>>>>>>>>> consistency for you; you can’t get out of sync.
>>>>>>>>>>>>>>> I think a baseline requirement in C* for something to be in 
>>>>>>>>>>>>>>> production is to be able to run preview repair and validate 
>>>>>>>>>>>>>>> that the transaction system or any other part of Cassandra 
>>>>>>>>>>>>>>> hasn't made a mistake. Divergence can have many sources 
>>>>>>>>>>>>>>> including Accord.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Runtian Liu wrote:
>>>>>>>>>>>>>>>> For the example David mentioned, LWT cannot support. Since 
>>>>>>>>>>>>>>>> LWTs operate on a single token, we’ll need to restrict 
>>>>>>>>>>>>>>>> base-table updates to one partition—and ideally one row—at a 
>>>>>>>>>>>>>>>> time. A current MV base-table command can delete an entire 
>>>>>>>>>>>>>>>> partition, but doing so might touch hundreds of MV partitions, 
>>>>>>>>>>>>>>>> making consistency guarantees impossible. 
>>>>>>>>>>>>>>> I think this can be represented as a tombstone which can always 
>>>>>>>>>>>>>>> be fetched from the base table on read or maybe some other 
>>>>>>>>>>>>>>> arrangement? I agree it can't feasibly be represented as an 
>>>>>>>>>>>>>>> enumeration of the deletions at least not synchronously and 
>>>>>>>>>>>>>>> doing it async has its own problems.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ariel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg 
>>>>>>>>>>>>>>>>> <ar...@weisberg.ws> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am *big* fan of getting repair really working with MVs. It 
>>>>>>>>>>>>>>>>> does seem problematic that the number of merkle trees will be 
>>>>>>>>>>>>>>>>> equal to the number of ranges in the cluster and repair of 
>>>>>>>>>>>>>>>>> MVs would become an all node operation.  How would down nodes 
>>>>>>>>>>>>>>>>> be handled and how many nodes would simultaneously working to 
>>>>>>>>>>>>>>>>> validate a given base table range at once? How many base 
>>>>>>>>>>>>>>>>> table ranges could simultaneously be repairing MVs?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If a row containing a column that creates an MV partition is 
>>>>>>>>>>>>>>>>> deleted, and the MV isn't updated, then how does the merkle 
>>>>>>>>>>>>>>>>> tree approach propagate the deletion to the MV? The CEP says 
>>>>>>>>>>>>>>>>> that anti-compaction would remove extra rows, but I am not 
>>>>>>>>>>>>>>>>> clear on how that works. When is anti-compaction performed in 
>>>>>>>>>>>>>>>>> the repair process and what is/isn't included in the outputs?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I thought about these two points last night after I sent my 
>>>>>>>>>>>>>>>> email.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> One is the lack of tombstones / deletions in the merle trees, 
>>>>>>>>>>>>>>>> which makes properly dealing with writes/deletes/inconsistency 
>>>>>>>>>>>>>>>> very hard (afaict)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The second is the reality that repairing a single partition in 
>>>>>>>>>>>>>>>> the base table may repair all hosts/ranges in the MV table, 
>>>>>>>>>>>>>>>> and vice versa. Basically scanning either base or MV is 
>>>>>>>>>>>>>>>> effectively scanning the whole cluster (modulo what you can 
>>>>>>>>>>>>>>>> avoid in the clean/dirty repaired sets). This makes me really, 
>>>>>>>>>>>>>>>> really concerned with how it scales, and how likely it is to 
>>>>>>>>>>>>>>>> be able to schedule automatically without blowing up. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The paxos vs accord comments so far are interesting in that I 
>>>>>>>>>>>>>>>> think both could be made to work, but I am very concerned 
>>>>>>>>>>>>>>>> about how the merkle tree comparisons are likely to work with 
>>>>>>>>>>>>>>>> wide partitions leading to massive fanout in ranges. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to