> On May 12, 2025, at 8:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > Hi, > > I think it's worth taking a step back and looking at the current MV > restrictions which are pretty onerous. > > A view must have a primary key and that primary key must conform to the > following restrictions: > it must contain all the primary key columns of the base table. This ensures > that every row of the view correspond to exactly one row of the base table. > it can only contain a single column that is not a primary key column in the > base table. > At that point what exactly is the value in including anything except the > original primary key in the MV's primary key columns unless you are using an > ordered partitioner so you can iterate based on the leading primary key > columns? > > Like something doesn't add up here because if it always includes the base > table's primary key columns that means they could be storage attached by just > forbidding additional columns and there doesn't seem to be much utility in > including additional columns in the primary key? > > I'm not that clear on how much better it is to look something up in the MV vs > just looking at the base table or some non-materialized view of it. How > exactly are these MVs supposed to be used and what value do they provide? > > Jeff Jirsa wrote: >> There’s 2 things in this proposal that give me a lot of pause. > > Runtian Liu pointed out that the CEP is sort of divided into two parts. The > first is the online part which is making reads/writes to MVs safer and more > reliable using a transaction system. The second is offline which is repair. > > The story for the online portion I think is quite strong and worth > considering on its own merits. > > The offline portion (repair) sounds a little less feasible to run in > production, but I also think that MVs without any mechanism for checking > their consistency are not viable to run in production. So it's kind of pay > for what you use in terms of the feature? > > It's definitely worth thinking through if there is a way to fix one side of > this equation so it works better.
Agree that we need a solution. I just don’t think a massive number of merkle trees without tombstones is actually going to be materially better (or rather, it’s a massive foot gun, it’s going to blow up people who read the CEP as “now it’s safe to use”). > > David Capwell wrote: >> As far as I can tell, being based off Accord means you don’t need to care >> about repair, as Accord will manage the consistency for you; you can’t get >> out of sync. > I think a baseline requirement in C* for something to be in production is to > be able to run preview repair and validate that the transaction system or any > other part of Cassandra hasn't made a mistake. Divergence can have many > sources including Accord. Or compaction hasn’t made a mistake, or cell merge reconciliation hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file. Repair isnt’ just about “have all transaction commits landed”. It’s “is the data correct N days after it’s written”. > > Runtian Liu wrote: >> For the example David mentioned, LWT cannot support. Since LWTs operate on a >> single token, we’ll need to restrict base-table updates to one partition—and >> ideally one row—at a time. A current MV base-table command can delete an >> entire partition, but doing so might touch hundreds of MV partitions, making >> consistency guarantees impossible. > I think this can be represented as a tombstone which can always be fetched > from the base table on read or maybe some other arrangement? I agree it can't > feasibly be represented as an enumeration of the deletions at least not > synchronously and doing it async has its own problems. > > Ariel > > On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote: >> >> >>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>> >>> >>> I am *big* fan of getting repair really working with MVs. It does seem >>> problematic that the number of merkle trees will be equal to the number of >>> ranges in the cluster and repair of MVs would become an all node operation. >>> How would down nodes be handled and how many nodes would simultaneously >>> working to validate a given base table range at once? How many base table >>> ranges could simultaneously be repairing MVs? >>> >>> If a row containing a column that creates an MV partition is deleted, and >>> the MV isn't updated, then how does the merkle tree approach propagate the >>> deletion to the MV? The CEP says that anti-compaction would remove extra >>> rows, but I am not clear on how that works. When is anti-compaction >>> performed in the repair process and what is/isn't included in the outputs? >> >> >> >> I thought about these two points last night after I sent my email. >> >> There’s 2 things in this proposal that give me a lot of pause. >> >> One is the lack of tombstones / deletions in the merle trees, which makes >> properly dealing with writes/deletes/inconsistency very hard (afaict) >> >> The second is the reality that repairing a single partition in the base >> table may repair all hosts/ranges in the MV table, and vice versa. Basically >> scanning either base or MV is effectively scanning the whole cluster (modulo >> what you can avoid in the clean/dirty repaired sets). This makes me really, >> really concerned with how it scales, and how likely it is to be able to >> schedule automatically without blowing up. >> >> The paxos vs accord comments so far are interesting in that I think both >> could be made to work, but I am very concerned about how the merkle tree >> comparisons are likely to work with wide partitions leading to massive >> fanout in ranges. >> >> >