Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Jeff Jirsa Mon, 12 May 2025 08:22:41 -0700


> On May 12, 2025, at 8:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
> 
> Hi,
> 
> I think it's worth taking a step back and looking at the current MV 
> restrictions which are pretty onerous.
> 
> A view must have a primary key and that primary key must conform to the 
> following restrictions:
> it must contain all the primary key columns of the base table. This ensures 
> that every row of the view correspond to exactly one row of the base table.
> it can only contain a single column that is not a primary key column in the 
> base table.
> At that point what exactly is the value in including anything except the 
> original primary key in the MV's primary key columns unless you are using an 
> ordered partitioner so you can iterate based on the leading primary key 
> columns?
> 
> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?
> 
> I'm not that clear on how much better it is to look something up in the MV vs 
> just looking at the base table or some non-materialized view of it. How 
> exactly are these MVs supposed to be used and what value do they provide?
> 
> Jeff Jirsa wrote:
>> There’s 2 things in this proposal that give me a lot of pause.
> 
> Runtian Liu pointed out that the CEP is sort of divided into two parts. The 
> first is the online part which is making reads/writes to MVs safer and more 
> reliable using a transaction system. The second is offline which is repair.
> 
> The story for the online portion I think is quite strong and worth 
> considering on its own merits.
> 
> The offline portion (repair) sounds a little less feasible to run in 
> production, but I also think that MVs without any mechanism for checking 
> their consistency are not viable to run in production. So it's kind of pay 
> for what you use in terms of the feature?
> 
> It's definitely worth thinking through if there is a way to fix one side of 
> this equation so it works better.


Agree that we need a solution. I just don’t think a massive number of merkle 
trees without tombstones is actually going to be materially better (or rather, 
it’s a massive foot gun, it’s going to blow up people who read the CEP as “now 
it’s safe to use”). 

> 
> David Capwell wrote:
>> As far as I can tell, being based off Accord means you don’t need to care 
>> about repair, as Accord will manage the consistency for you; you can’t get 
>> out of sync.
> I think a baseline requirement in C* for something to be in production is to 
> be able to run preview repair and validate that the transaction system or any 
> other part of Cassandra hasn't made a mistake. Divergence can have many 
> sources including Accord.

Or compaction hasn’t made a mistake, or cell merge reconciliation hasn’t made a 
mistake, or volume bitrot hasn’t caused you to lose a file.

Repair isnt’ just about “have all transaction commits landed”. It’s “is the 
data correct N days after it’s written”. 

> 
> Runtian Liu wrote:
>> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
>> single token, we’ll need to restrict base-table updates to one partition—and 
>> ideally one row—at a time. A current MV base-table command can delete an 
>> entire partition, but doing so might touch hundreds of MV partitions, making 
>> consistency guarantees impossible. 
> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doing it async has its own problems.
> 
> Ariel
> 
> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>> 
>> 
>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>> 
>>> 
>>> I am *big* fan of getting repair really working with MVs. It does seem 
>>> problematic that the number of merkle trees will be equal to the number of 
>>> ranges in the cluster and repair of MVs would become an all node operation. 
>>>  How would down nodes be handled and how many nodes would simultaneously 
>>> working to validate a given base table range at once? How many base table 
>>> ranges could simultaneously be repairing MVs?
>>> 
>>> If a row containing a column that creates an MV partition is deleted, and 
>>> the MV isn't updated, then how does the merkle tree approach propagate the 
>>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>>> rows, but I am not clear on how that works. When is anti-compaction 
>>> performed in the repair process and what is/isn't included in the outputs?
>> 
>> 
>> 
>> I thought about these two points last night after I sent my email.
>> 
>> There’s 2 things in this proposal that give me a lot of pause.
>> 
>> One is the lack of tombstones / deletions in the merle trees, which makes 
>> properly dealing with writes/deletes/inconsistency very hard (afaict)
>> 
>> The second is the reality that repairing a single partition in the base 
>> table may repair all hosts/ranges in the MV table, and vice versa. Basically 
>> scanning either base or MV is effectively scanning the whole cluster (modulo 
>> what you can avoid in the clean/dirty repaired sets). This makes me really, 
>> really concerned with how it scales, and how likely it is to be able to 
>> schedule automatically without blowing up. 
>> 
>> The paxos vs accord comments so far are interesting in that I think both 
>> could be made to work, but I am very concerned about how the merkle tree 
>> comparisons are likely to work with wide partitions leading to massive 
>> fanout in ranges. 
>> 
>> 
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to