Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Blake Eggleston Tue, 24 Jun 2025 15:26:29 -0700

Those are both fair points. Once you start dealing with data loss though, 
maintaining guarantees is often impossible, so I’m not sure that torn writes or 
updated timestamps are dealbreakers, but I’m fine with tabling option 2 for now 
and seeing if we can figure out something better.


Regarding the assassin cells, if you wanted to avoid explicitly agreeing on a 
value, you might be able to only issue them for repaired base data, which has 
been implicitly agreed upon.

I think that or something like it is worth exploring. The idea would be to 
solve this issue as completely as anti-compaction would - but without having to 
rewrite sstables. I’d be interested to hear any ideas you have about how that 
might work.

You basically need a mechanism to erase some piece of data that was written 
before a given wall clock time - regardless of cell timestamp, and without 
precluding future updates (in wall clock time) with earlier timestamps.

On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
> In the second option, we use the repair timestamp to re-update any cell or 
> row we want to fix in the base table. This approach is problematic because it 
> alters the write time of user-supplied data. Although Cassandra does not 
> allow users to set timestamps for LWT writes, users may still rely on the 
> update time. A key limitation of this approach is that it cannot fix cases 
> where a view cell ends up in a future state while the base table remains 
> correct. I now understand your point that Cassandra cannot handle this 
> scenario today. However, as I mentioned earlier, the important distinction is 
> that when this issue occurs in the base table, we accept the "incorrect" data 
> as valid—but this is not acceptable for materialized views, since the source 
> of truth (the base table) still holds the correct data.
> 
> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston <[email protected]> wrote:
>> __
>> > Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>> > sooner.
>> 
>> No worries, I’ll be taking some time off soon as well.
>> 
>> > I don’t think the first or second option is ideal. We should treat the 
>> > base table as the source of truth. Modifying it—or forcing an update on 
>> > it, even if it’s just a timestamp change—is not a good approach and won’t 
>> > solve all problems.
>> 
>> I agree the first option probably isn’t the right way to go. Could you say a 
>> bit more about why the second option is not a good approach and which 
>> problems it won’t solve?
>> 
>> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>>> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>>> sooner.
>>> 
>>> > First - we interpret view data with higher timestamps than the base table 
>>> > as data that’s missing from the base and replicate it into the base 
>>> > table. The timestamp of the missing data may be below the paxos timestamp 
>>> > low bound so we’d have to adjust the paxos coordination logic to allow 
>>> > that in this case. Depending on how the view got this way it may also 
>>> > tear writes to the base table, breaking the write atomicity promise.
>>> 
>>> As discussed earlier, we want this MV repair mechanism to handle all edge 
>>> cases. However, it would be difficult to design it in a way that detects 
>>> the root cause of each mismatch and repairs it accordingly. Additionally, 
>>> as you mentioned, this approach could introduce other issues, such as 
>>> violating the write atomicity guarantee.
>>> 
>>> > Second - If this happens it means that we’ve either lost base table data 
>>> > or paxos metadata. If that happened, we could force a base table update 
>>> > that rewrites the current base state with new timestamps making the extra 
>>> > view data removable. However this wouldn’t fix the case where the view 
>>> > cell has a timestamp from the future - although that’s not a case that C* 
>>> > can fix today either.
>>> 
>>> I don’t think the first or second option is ideal. We should treat the base 
>>> table as the source of truth. Modifying it—or forcing an update on it, even 
>>> if it’s just a timestamp change—is not a good approach and won’t solve all 
>>> problems.
>>> 
>>> > the idea to use anti-compaction makes a lot more sense now (in principle 
>>> > - I don’t think it’s workable in practice)
>>> 
>>> I have one question regarding anti-compaction. Is the main concern that 
>>> processing too much data during anti-compaction could cause issues for the 
>>> cluster? 
>>> 
>>> > I guess you could add some sort of assassin cell that is meant to remove 
>>> > a cell with a specific timestamp and value, but is otherwise invisible. 
>>> 
>>> The idea of the assassination cell is interesting. To prevent data from 
>>> being incorrectly removed during the repair process, we need to ensure a 
>>> quorum of nodes is available and agrees on the same value before repairing 
>>> a materialized view (MV) row or cell. However, this could be very 
>>> expensive, as it requires coordination to repair even a single row.
>>> 
>>> I think there are a few key differences between MV repair and normal 
>>> anti-entropy repair:
>>> 
>>>  1. For normal anti-entropy repair, there is no single source of truth.
>>> All replicas act as sources of truth, and eventual consistency is achieved 
>>> by merging data across them. Even if some data is lost or bugs result in 
>>> future timestamps, the replicas will eventually converge on the same 
>>> (possibly incorrect) value, and that value becomes the accepted truth. In 
>>> contrast, MV repair relies on the base table as the source of truth and 
>>> modifies the MV if inconsistencies are detected. This means that in some 
>>> cases, simply merging data won’t resolve the issue, since Cassandra 
>>> resolves conflicts using timestamps and there’s no guarantee the base table 
>>> will always "win" unless we change the MV merging logic—which I’ve been 
>>> trying to avoid.
>>> 
>>>  2. I see MV repair as more of an on-demand operation, whereas normal 
>>> anti-entropy repair needs to run regularly. This means we shouldn’t treat 
>>> MV repair the same way as existing repairs. When an operator initiates MV 
>>> repair, they need to ensure that sufficient resources are available to 
>>> support it.
>>> 
>>> 
>>> On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <[email protected]> 
>>> wrote:
>>>> __
>>>> Got it, thanks for clearing that up. I’d seen the strict liveness code 
>>>> around but didn’t realize it was MV related and hadn’t dug into what it 
>>>> did or how it worked. 
>>>> 
>>>> I think you’re right about the row liveness update working for extra data 
>>>> with timestamps lower than the most recent base table update.
>>>> 
>>>> I see what you mean about the timestamp from the future case. I thought of 
>>>> 3 options:
>>>> 
>>>> First - we interpret view data with higher timestamps than the base table 
>>>> as data that’s missing from the base and replicate it into the base table. 
>>>> The timestamp of the missing data may be below the paxos timestamp low 
>>>> bound so we’d have to adjust the paxos coordination logic to allow that in 
>>>> this case. Depending on how the view got this way it may also tear writes 
>>>> to the base table, breaking the write atomicity promise.
>>>> 
>>>> Second - If this happens it means that we’ve either lost base table data 
>>>> or paxos metadata. If that happened, we could force a base table update 
>>>> that rewrites the current base state with new timestamps making the extra 
>>>> view data removable. However this wouldn’t fix the case where the view 
>>>> cell has a timestamp from the future - although that’s not a case that C* 
>>>> can fix today either.
>>>> 
>>>> Third - we add a new tombstone type or some mechanism to delete specific 
>>>> cells that doesn’t preclude correct writes with lower timestamps from 
>>>> being visible. I’m not sure how this would work, and the idea to use 
>>>> anti-compaction makes a lot more sense now (in principle - I don’t think 
>>>> it’s workable in practice). I guess you could add some sort of assassin 
>>>> cell that is meant to remove a cell with a specific timestamp and value, 
>>>> but is otherwise invisible. This seems dangerous though, since it’s likely 
>>>> there’s a replication problem that may resolve itself and the repair 
>>>> process would actually be removing data that the user intended to write.
>>>> 
>>>> Paulo - I don’t think storage changes are off the table, but they do 
>>>> expand the scope and risk of the proposal, so we should be careful.
>>>> 
>>>> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote:
>>>>>  > I’m not sure if this is the only edge case—there may be other issues 
>>>>> as well. I’m also unsure whether we should redesign the tombstone 
>>>>> handling for MVs, since that would involve changes to the storage engine. 
>>>>> To minimize impact there, the original proposal was to rebuild the 
>>>>> affected ranges using anti-compaction, just to be safe.
>>>>> 
>>>>> I haven't been following the discussion but I think one of the issues 
>>>>> with the materialized view "strict liveness" fix[1] is that we avoided 
>>>>> making invasive changes to the storage engine at the time, but this was 
>>>>> considered by Zhao on [1]. I think we shouldn't be trying to avoid 
>>>>> updates to the storage format as part of the MV implementation, if this 
>>>>> is what it takes to make MVs V2 reliable.
>>>>> 
>>>>> [1] - 
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603
>>>>> 
>>>>> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <[email protected]> wrote:
>>>>>> The current design leverages strict liveness to shadow the old view row. 
>>>>>> When the view-indexed value changes from 'a' to 'b', no tombstone is 
>>>>>> written; instead, the old row is marked as expired by updating its 
>>>>>> liveness info with the timestamp of the change. If the column is later 
>>>>>> set back to 'a', the view row is re-inserted with a new, non-expired 
>>>>>> liveness info reflecting the latest timestamp.
>>>>>> 
>>>>>> To delete an extra row in the materialized view (MV), we can likely use 
>>>>>> the same approach—marking it as shadowed by updating the liveness info. 
>>>>>> However, in the case of inconsistencies where a column in the MV has a 
>>>>>> higher timestamp than the corresponding column in the base table, this 
>>>>>> row-level liveness mechanism is insufficient.
>>>>>> 
>>>>>> Even for the case where we delete the row by marking its liveness info 
>>>>>> as expired during repair, there are concerns. Since this introduces a 
>>>>>> data mutation as part of the repair process, it’s unclear whether there 
>>>>>> could be edge cases we’re missing. This approach may risk unexpected 
>>>>>> side effects if the repair logic is not carefully aligned with write 
>>>>>> path semantics.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston <[email protected]> 
>>>>>> wrote:
>>>>>>> __
>>>>>>> That’s a good point, although as described I don’t think that could 
>>>>>>> ever work properly, even in normal operation. Either we’re 
>>>>>>> misunderstanding something, or this is a flaw in the current MV design.
>>>>>>> 
>>>>>>> Assuming changing the view indexed column results in a tombstone being 
>>>>>>> applied to the view row for the previous value, if we wrote the other 
>>>>>>> base columns (the non view indexed ones) to the view with the same 
>>>>>>> timestamps they have on the base, then changing the view indexed value 
>>>>>>> from ‘a’ to ‘b’, then back to ‘a’ would always cause this problem. I 
>>>>>>> think you’d need to always update the column timestamps on the view to 
>>>>>>> be >= the view column timestamp on the base
>>>>>>> 
>>>>>>> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>>>>>>>> > In the case of a missed update, we'll have a new value and we can 
>>>>>>>> > send a tombstone to the view with the timestamp of the most recent 
>>>>>>>> > update.
>>>>>>>> 
>>>>>>>> > then something has gone wrong and we should issue a tombstone using 
>>>>>>>> > the paxos repair timestamp as the tombstone timestamp.
>>>>>>>> 
>>>>>>>> The current MV implementation uses “strict liveness” to determine 
>>>>>>>> whether a row is live. I believe that using regular tombstones during 
>>>>>>>> repair could cause problems. For example, consider a base table with 
>>>>>>>> schema (pk, ck, v1, v2) and a materialized view with schema (v1, pk, 
>>>>>>>> ck) -> v2. If, for some reason, we detect an extra row in the MV and 
>>>>>>>> delete it using a tombstone with the latest update timestamp, we may 
>>>>>>>> run into issues. Suppose we later update the base table’s v1 field to 
>>>>>>>> match the MV row we previously deleted, and the v2 value now has an 
>>>>>>>> older timestamp. In that case, the previously issued tombstone could 
>>>>>>>> still shadow the v2 column, which is unintended. That is why I was 
>>>>>>>> asking if we are going to introduce a new kind of tombstones. I’m not 
>>>>>>>> sure if this is the only edge case—there may be other issues as well. 
>>>>>>>> I’m also unsure whether we should redesign the tombstone handling for 
>>>>>>>> MVs, since that would involve changes to the storage engine. To 
>>>>>>>> minimize impact there, the original proposal was to rebuild the 
>>>>>>>> affected ranges using anti-compaction, just to be safe.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> __
>>>>>>>>>>  Extra row in MV (assuming the tombstone is gone in the base table) 
>>>>>>>>>> — how should we fix this?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This would mean that the base table had either updated or deleted a 
>>>>>>>>> row and the view didn't receive the corresponding delete.
>>>>>>>>> 
>>>>>>>>> In the case of a missed update, we'll have a new value and we can 
>>>>>>>>> send a tombstone to the view with the timestamp of the most recent 
>>>>>>>>> update. Since timestamps issued by paxos and accord writes are always 
>>>>>>>>> increasing monotonically and don't have collisions, this is safe.
>>>>>>>>> 
>>>>>>>>> In the case of a row deletion, we'd also want to send a tombstone 
>>>>>>>>> with the same timestamp, however since tombstones can be purged, we 
>>>>>>>>> may not have that information and would have to treat it like the 
>>>>>>>>> view has a higher timestamp than the base table.
>>>>>>>>> 
>>>>>>>>>> Inconsistency (timestamps don’t match) — it’s easy to fix when the 
>>>>>>>>>> base table has higher timestamps, but how do we resolve it when the 
>>>>>>>>>> MV columns have higher timestamps?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> There are 2 ways this could happen. First is that a write failed and 
>>>>>>>>> paxos repair hasn't completed it, which is expected, and the second 
>>>>>>>>> is a replication bug or base table data loss. You'd need to compare 
>>>>>>>>> the view timestamp to the paxos repair history to tell which it is. 
>>>>>>>>> If the view timestamp is higher than the most recent paxos repair 
>>>>>>>>> timestamp for the key, then it may just be a failed write and we 
>>>>>>>>> should do nothing. If the view timestamp is less than the most recent 
>>>>>>>>> paxos repair timestamp for that key and higher than the base 
>>>>>>>>> timestamp, then something has gone wrong and we should issue a 
>>>>>>>>> tombstone using the paxos repair timestamp as the tombstone 
>>>>>>>>> timestamp. This is safe to do because the paxos repair timestamps act 
>>>>>>>>> as a low bound for ballots paxos will process, so it wouldn't be 
>>>>>>>>> possible for a legitimate write to be shadowed by this tombstone.
>>>>>>>>> 
>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the rows 
>>>>>>>>>> in the MV for cases 2 and 3? If yes, how will this tombstone work? 
>>>>>>>>>> If no, how should we fix the MV data?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No, a normal tombstone would work.
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>>>>>>>>>> Okay, let’s put the efficiency discussion on hold for now. I want to 
>>>>>>>>>> make sure the actual repair process after detecting inconsistencies 
>>>>>>>>>> will work with the index-based solution.
>>>>>>>>>> 
>>>>>>>>>> When a mismatch is detected, the MV replica will need to stream its 
>>>>>>>>>> index file to the base table replica. The base table will then 
>>>>>>>>>> perform a comparison between the two files.
>>>>>>>>>> 
>>>>>>>>>> There are three cases we need to handle:
>>>>>>>>>> 
>>>>>>>>>>  1. Missing row in MV — this is straightforward; we can propagate 
>>>>>>>>>> the data to the MV.
>>>>>>>>>> 
>>>>>>>>>>  2. Extra row in MV (assuming the tombstone is gone in the base 
>>>>>>>>>> table) — how should we fix this?
>>>>>>>>>> 
>>>>>>>>>>  3. Inconsistency (timestamps don’t match) — it’s easy to fix when 
>>>>>>>>>> the base table has higher timestamps, but how do we resolve it when 
>>>>>>>>>> the MV columns have higher timestamps?
>>>>>>>>>> 
>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the rows 
>>>>>>>>>> in the MV for cases 2 and 3? If yes, how will this tombstone work? 
>>>>>>>>>> If no, how should we fix the MV data?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> __
>>>>>>>>>>> > hopefully we can come up with a solution that everyone agrees on.
>>>>>>>>>>> 
>>>>>>>>>>> I’m sure we can, I think we’ve been making good progress
>>>>>>>>>>> 
>>>>>>>>>>> > My main concern with the index-based solution is the overhead it 
>>>>>>>>>>> > adds to the hot path, as well as having to build indexes 
>>>>>>>>>>> > periodically.
>>>>>>>>>>> 
>>>>>>>>>>> So the additional overhead of maintaining a storage attached index 
>>>>>>>>>>> on the client write path is pretty minimal - it’s basically adding 
>>>>>>>>>>> data to an in memory trie. It’s a little extra work and memory 
>>>>>>>>>>> usage, but there isn’t any extra io or other blocking associated 
>>>>>>>>>>> with it. I’d expect the latency impact to be negligible.
>>>>>>>>>>> 
>>>>>>>>>>> > As mentioned earlier, this MV repair should be an infrequent 
>>>>>>>>>>> > operation
>>>>>>>>>>> 
>>>>>>>>>>> I don’t this that’s a safe assumption. There are a lot of 
>>>>>>>>>>> situations outside of data loss bugs where repair would need to be 
>>>>>>>>>>> run. 
>>>>>>>>>>> 
>>>>>>>>>>> These use cases could probably be handled by repairing the view 
>>>>>>>>>>> with other view replicas:
>>>>>>>>>>> 
>>>>>>>>>>> Scrubbing corrupt sstables
>>>>>>>>>>> Node replacement via backup
>>>>>>>>>>> 
>>>>>>>>>>> These use cases would need an actual MV repair to check consistency 
>>>>>>>>>>> with the base table:
>>>>>>>>>>> 
>>>>>>>>>>> Restoring a cluster from a backup
>>>>>>>>>>> Imported sstables via nodetool import
>>>>>>>>>>> Data loss from operator error
>>>>>>>>>>> Proactive consistency checks - ie preview repairs
>>>>>>>>>>> 
>>>>>>>>>>> Even if it is an infrequent operation, when operators need it, it 
>>>>>>>>>>> needs to be available and reliable.
>>>>>>>>>>> 
>>>>>>>>>>> It’s a fact that there are clusters where non-incremental repairs 
>>>>>>>>>>> are run on a cadence of a week or more to manage the overhead of 
>>>>>>>>>>> validation compactions. Assuming the cluster doesn’t have any 
>>>>>>>>>>> additional headroom, that would mean that any one of the above 
>>>>>>>>>>> events could cause views to remain out of sync for up to a week 
>>>>>>>>>>> while the full set of merkle trees is being built.
>>>>>>>>>>> 
>>>>>>>>>>> This delay eliminates a lot of the value of repair as a risk 
>>>>>>>>>>> mitigation tool. If I had to make a recommendation where a bad call 
>>>>>>>>>>> could cost me my job, the prospect of a 7 day delay on repair would 
>>>>>>>>>>> mean a strong no.
>>>>>>>>>>> 
>>>>>>>>>>> Some users also run preview repair continuously to detect data 
>>>>>>>>>>> consistency errors, so at least a subset of users will probably be 
>>>>>>>>>>> running MV repairs continuously - at least in preview mode.
>>>>>>>>>>> 
>>>>>>>>>>> That’s why I say that the replication path should be designed to 
>>>>>>>>>>> never need repair, and MV repair should be designed to be prepared 
>>>>>>>>>>> for the worst.
>>>>>>>>>>> 
>>>>>>>>>>> > I’m wondering if it’s possible to enable or disable index 
>>>>>>>>>>> > building dynamically so that we don’t always incur the cost for 
>>>>>>>>>>> > something that’s rarely needed.
>>>>>>>>>>> 
>>>>>>>>>>> I think this would be a really reasonable compromise as long as the 
>>>>>>>>>>> default is on. That way it’s as safe as possible by default, but 
>>>>>>>>>>> users who don’t care or have a separate system for repairing MVs 
>>>>>>>>>>> can opt out.
>>>>>>>>>>> 
>>>>>>>>>>> > I’m not sure what you mean by “data problems” here.
>>>>>>>>>>> 
>>>>>>>>>>> I mean out of sync views - either due to bugs, operator error, 
>>>>>>>>>>> corruption, etc
>>>>>>>>>>> 
>>>>>>>>>>> > Also, this does scale with cluster size—I’ve compared it to full 
>>>>>>>>>>> > repair, and this MV repair should behave similarly. That means as 
>>>>>>>>>>> > long as full repair works, this repair should work as well.
>>>>>>>>>>> 
>>>>>>>>>>> You could build the merkle trees at about the same cost as a full 
>>>>>>>>>>> repair, but the actual data repair path is completely different for 
>>>>>>>>>>> MV, and that’s the part that doesn’t scale well. As you know, with 
>>>>>>>>>>> normal repair, we just stream data for ranges detected as out of 
>>>>>>>>>>> sync. For Mvs, since the data isn’t in base partition order, the 
>>>>>>>>>>> view data for an out of sync view range needs to be read out and 
>>>>>>>>>>> streamed to every base replica that it’s detected a mismatch 
>>>>>>>>>>> against. So in the example I gave with the 300 node cluster, you’re 
>>>>>>>>>>> looking at reading and transmitting the same partition at least 100 
>>>>>>>>>>> times in the best case, and the cost of this keeps going up as the 
>>>>>>>>>>> cluster increases in size. That's the part that doesn't scale well. 
>>>>>>>>>>> 
>>>>>>>>>>> This is also one the benefits of the index design. Since it stores 
>>>>>>>>>>> data in segments that roughly correspond to points on the grid, 
>>>>>>>>>>> you’re not rereading the same data over and over. A repair for a 
>>>>>>>>>>> given grid point only reads an amount of data proportional to the 
>>>>>>>>>>> data in common for the base/view grid point, and it’s stored in a 
>>>>>>>>>>> small enough granularity that the base can calculate what data 
>>>>>>>>>>> needs to be sent to the view without having to read the entire view 
>>>>>>>>>>> partition.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>>>>>>>>>>>> Thanks, Blake. I’m open to iterating on the design, and hopefully 
>>>>>>>>>>>> we can come up with a solution that everyone agrees on.
>>>>>>>>>>>> 
>>>>>>>>>>>> My main concern with the index-based solution is the overhead it 
>>>>>>>>>>>> adds to the hot path, as well as having to build indexes 
>>>>>>>>>>>> periodically. As mentioned earlier, this MV repair should be an 
>>>>>>>>>>>> infrequent operation, but the index-based approach shifts some of 
>>>>>>>>>>>> the work to the hot path in order to allow repairs that touch only 
>>>>>>>>>>>> a few nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> I’m wondering if it’s possible to enable or disable index building 
>>>>>>>>>>>> dynamically so that we don’t always incur the cost for something 
>>>>>>>>>>>> that’s rarely needed.
>>>>>>>>>>>> 
>>>>>>>>>>>> > it degrades operators ability to react to data problems by 
>>>>>>>>>>>> > imposing a significant upfront processing burden on repair, and 
>>>>>>>>>>>> > that it doesn’t scale well with cluster size
>>>>>>>>>>>> 
>>>>>>>>>>>> I’m not sure what you mean by “data problems” here. Also, this 
>>>>>>>>>>>> does scale with cluster size—I’ve compared it to full repair, and 
>>>>>>>>>>>> this MV repair should behave similarly. That means as long as full 
>>>>>>>>>>>> repair works, this repair should work as well.
>>>>>>>>>>>> 
>>>>>>>>>>>> For example, regardless of how large the cluster is, you can 
>>>>>>>>>>>> always enable Merkle tree building on 10% of the nodes at a time 
>>>>>>>>>>>> until all the trees are ready.
>>>>>>>>>>>> 
>>>>>>>>>>>> I understand that coordinating this type of repair is harder than 
>>>>>>>>>>>> what we currently support, but with CEP-37, we should be able to 
>>>>>>>>>>>> handle this coordination without adding too much burden on the 
>>>>>>>>>>>> operator side.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston 
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>> __
>>>>>>>>>>>>>> I don't see any outcome here that is good for the community 
>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he (and 
>>>>>>>>>>>>>> I) consider inferior, or he is prevented from contributing this 
>>>>>>>>>>>>>> work.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a 
>>>>>>>>>>>>> competition. We can collaborate and figure out the best approach 
>>>>>>>>>>>>> to the problem. I’d like to keep discussing it if you’re open to 
>>>>>>>>>>>>> iterating on the design.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I’m not married to our proposal, it’s just the cleanest way we 
>>>>>>>>>>>>> could think of to address what Jon and I both see as blockers in 
>>>>>>>>>>>>> the current proposal. It’s not set in stone though.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>>>>>>>>>>>>>> Hmm, I am very surprised as I helped write that and I distinctly 
>>>>>>>>>>>>>> recall a specific goal was avoiding binding vetoes as they're so 
>>>>>>>>>>>>>> toxic.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ok, I guess you can block this work if you like.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I don't see any outcome here that is good for the community 
>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he (and 
>>>>>>>>>>>>>> I) consider inferior, or he is prevented from contributing this 
>>>>>>>>>>>>>> work. That isn't a functioning community in my mind, so I'll be 
>>>>>>>>>>>>>> noping out for a while, as I don't see much value here right now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2025/06/06 18:31:08 Blake Eggleston wrote:
>>>>>>>>>>>>>> > Hi Benedict, that’s actually not true. 
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > Here’s a link to the project governance page: 
>>>>>>>>>>>>>> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > The CEP section says:
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > “*Once the proposal is finalized and any major committer 
>>>>>>>>>>>>>> > dissent reconciled, call a [VOTE] on the ML to have the 
>>>>>>>>>>>>>> > proposal adopted. The criteria for acceptance is consensus (3 
>>>>>>>>>>>>>> > binding +1 votes and no binding vetoes). The vote should 
>>>>>>>>>>>>>> > remain open for 72 hours.*”
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > So they’re definitely vetoable. 
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > Also note the part about “*Once the proposal is finalized and 
>>>>>>>>>>>>>> > any major committer dissent reconciled,*” being a prerequisite 
>>>>>>>>>>>>>> > for moving a CEP to [VOTE]. Given the as yet unreconciled 
>>>>>>>>>>>>>> > committer dissent, it wouldn’t even be appropriate to move to 
>>>>>>>>>>>>>> > a VOTE until we get to the bottom of this repair discussion.
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
>>>>>>>>>>>>>> > > > but the snapshot repair design is not a viable path 
>>>>>>>>>>>>>> > > > forward. It’s the first iteration of a repair design. 
>>>>>>>>>>>>>> > > > We’ve proposed a second iteration, and we’re open to a 
>>>>>>>>>>>>>> > > > third iteration.
>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>> > > I shan't be participating further in discussion, but I want 
>>>>>>>>>>>>>> > > to make a point of order. The CEP process has no vetoes, so 
>>>>>>>>>>>>>> > > you are not empowered to declare that a design is not viable 
>>>>>>>>>>>>>> > > without the input of the wider community.
>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
>>>>>>>>>>>>>> > > > You can detect and fix the mismatch in a single round of 
>>>>>>>>>>>>>> > > > repair, but the amount of work needed to do it is 
>>>>>>>>>>>>>> > > > _significantly_ higher with snapshot repair. Consider a 
>>>>>>>>>>>>>> > > > case where we have a 300 node cluster w/ RF 3, where each 
>>>>>>>>>>>>>> > > > view partition contains entries mapping to every token 
>>>>>>>>>>>>>> > > > range in the cluster - so 100 ranges. If we lose a view 
>>>>>>>>>>>>>> > > > sstable, it will affect an entire row/column of the grid. 
>>>>>>>>>>>>>> > > > Repair is going to scan all data in the mismatching view 
>>>>>>>>>>>>>> > > > token ranges 100 times, and each base range once. So 
>>>>>>>>>>>>>> > > > you’re looking at 200 range scans.
>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>> > > > Now, you may argue that you can merge the duplicate view 
>>>>>>>>>>>>>> > > > scans into a single scan while you repair all token ranges 
>>>>>>>>>>>>>> > > > in parallel. I’m skeptical that’s going to be achievable 
>>>>>>>>>>>>>> > > > in practice, but even if it is, we’re now talking about 
>>>>>>>>>>>>>> > > > the view replica hypothetically doing a pairwise repair 
>>>>>>>>>>>>>> > > > with every other replica in the cluster at the same time. 
>>>>>>>>>>>>>> > > > Neither of these options is workable.
>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>> > > > Let’s take a step back though, because I think we’re 
>>>>>>>>>>>>>> > > > getting lost in the weeds.
>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>> > > > The repair design in the CEP has some high level concepts 
>>>>>>>>>>>>>> > > > that make a lot of sense, the idea of repairing a grid is 
>>>>>>>>>>>>>> > > > really smart. However, it has some significant drawbacks 
>>>>>>>>>>>>>> > > > that remain unaddressed. I want this CEP to succeed, and I 
>>>>>>>>>>>>>> > > > know Jon does too, but the snapshot repair design is not a 
>>>>>>>>>>>>>> > > > viable path forward. It’s the first iteration of a repair 
>>>>>>>>>>>>>> > > > design. We’ve proposed a second iteration, and we’re open 
>>>>>>>>>>>>>> > > > to a third iteration. This part of the CEP process is 
>>>>>>>>>>>>>> > > > meant to identify and address shortcomings, I don’t think 
>>>>>>>>>>>>>> > > > that continuing to dissect the snapshot repair design is 
>>>>>>>>>>>>>> > > > making progress in that direction.
>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
>>>>>>>>>>>>>> > > > > >  We potentially have to do it several times on each 
>>>>>>>>>>>>>> > > > > > node, depending on the size of the range. Smaller 
>>>>>>>>>>>>>> > > > > > ranges increase the size of the board exponentially, 
>>>>>>>>>>>>>> > > > > > larger ranges increase the number of SSTables that 
>>>>>>>>>>>>>> > > > > > would be involved in each compaction.
>>>>>>>>>>>>>> > > > > As described in the CEP example, this can be handled in 
>>>>>>>>>>>>>> > > > > a single round of repair. We first identify all the 
>>>>>>>>>>>>>> > > > > points in the grid that require repair, then perform 
>>>>>>>>>>>>>> > > > > anti-compaction and stream data based on a second scan 
>>>>>>>>>>>>>> > > > > over those identified points. This applies to the 
>>>>>>>>>>>>>> > > > > snapshot-based solution—without an index, repairing a 
>>>>>>>>>>>>>> > > > > single point in that grid requires scanning the entire 
>>>>>>>>>>>>>> > > > > base table partition (token range). In contrast, with 
>>>>>>>>>>>>>> > > > > the index-based solution—as in the example you 
>>>>>>>>>>>>>> > > > > referenced—if a large block of data is corrupted, even 
>>>>>>>>>>>>>> > > > > though the index is used for comparison, many key 
>>>>>>>>>>>>>> > > > > mismatches may occur. This can lead to random disk 
>>>>>>>>>>>>>> > > > > access to the original data files, which could cause 
>>>>>>>>>>>>>> > > > > performance issues. For the case you mentioned for 
>>>>>>>>>>>>>> > > > > snapshot based solution, it should not take months to 
>>>>>>>>>>>>>> > > > > repair all the data, instead one round of repair should 
>>>>>>>>>>>>>> > > > > be enough. The actual repair phase is split from the 
>>>>>>>>>>>>>> > > > > detection phase.
>>>>>>>>>>>>>> > > > > 
>>>>>>>>>>>>>> > > > > 
>>>>>>>>>>>>>> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad 
>>>>>>>>>>>>>> > > > > <[email protected]> wrote:
>>>>>>>>>>>>>> > > > >> > This isn’t really the whole story. The amount of 
>>>>>>>>>>>>>> > > > >> > wasted scans on index repairs is negligible. If a 
>>>>>>>>>>>>>> > > > >> > difference is detected with snapshot repairs though, 
>>>>>>>>>>>>>> > > > >> > you have to read the entire partition from both the 
>>>>>>>>>>>>>> > > > >> > view and base table to calculate what needs to be 
>>>>>>>>>>>>>> > > > >> > fixed.
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> You nailed it.
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> When the base table is converted to a view, and sent to 
>>>>>>>>>>>>>> > > > >> the view, the information we have is that one of the 
>>>>>>>>>>>>>> > > > >> view's partition keys needs a repair.  That's going to 
>>>>>>>>>>>>>> > > > >> be different from the partition key of the base table.  
>>>>>>>>>>>>>> > > > >> As a result, on the base table, for each affected 
>>>>>>>>>>>>>> > > > >> range, we'd have to issue another compaction across the 
>>>>>>>>>>>>>> > > > >> entire set of sstables that could have the data the 
>>>>>>>>>>>>>> > > > >> view needs (potentially many GB), in order to send over 
>>>>>>>>>>>>>> > > > >> the corrected version of the partition, then send it 
>>>>>>>>>>>>>> > > > >> over to the view.  Without an index in place, we have 
>>>>>>>>>>>>>> > > > >> to do yet another scan, per-affected range.  
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> Consider the case of a single corrupted SSTable on the 
>>>>>>>>>>>>>> > > > >> view that's removed from the filesystem, or the data is 
>>>>>>>>>>>>>> > > > >> simply missing after being restored from an 
>>>>>>>>>>>>>> > > > >> inconsistent backup.  It presumably contains lots of 
>>>>>>>>>>>>>> > > > >> partitions, which maps to base partitions all over the 
>>>>>>>>>>>>>> > > > >> cluster, in a lot of different token ranges.  For every 
>>>>>>>>>>>>>> > > > >> one of those ranges (hundreds, to tens of thousands of 
>>>>>>>>>>>>>> > > > >> them given the checkerboard design), when finding the 
>>>>>>>>>>>>>> > > > >> missing data in the base, you'll have to perform a 
>>>>>>>>>>>>>> > > > >> compaction across all the SSTables that potentially 
>>>>>>>>>>>>>> > > > >> contain the missing data just to rebuild the 
>>>>>>>>>>>>>> > > > >> view-oriented partitions that need to be sent to the 
>>>>>>>>>>>>>> > > > >> view.  The complexity of this operation can be looked 
>>>>>>>>>>>>>> > > > >> at as O(N*M) where N and M are the number of ranges in 
>>>>>>>>>>>>>> > > > >> the base table and the view affected by the corruption, 
>>>>>>>>>>>>>> > > > >> respectively.  Without an index in place, finding the 
>>>>>>>>>>>>>> > > > >> missing data is very expensive.  We potentially have to 
>>>>>>>>>>>>>> > > > >> do it several times on each node, depending on the size 
>>>>>>>>>>>>>> > > > >> of the range.  Smaller ranges increase the size of the 
>>>>>>>>>>>>>> > > > >> board exponentially, larger ranges increase the number 
>>>>>>>>>>>>>> > > > >> of SSTables that would be involved in each compaction.
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> Then you send that data over to the view, the view does 
>>>>>>>>>>>>>> > > > >> it's anti-compaction thing, again, once per affected 
>>>>>>>>>>>>>> > > > >> range.  So now the view has to do an anti-compaction 
>>>>>>>>>>>>>> > > > >> once per block on the board that's affected by the 
>>>>>>>>>>>>>> > > > >> missing data.  
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> Doing hundreds or thousands of these will add up pretty 
>>>>>>>>>>>>>> > > > >> quickly.
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> When I said that a repair could take months, this is 
>>>>>>>>>>>>>> > > > >> what I had in mind.
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
>>>>>>>>>>>>>> > > > >> <[email protected]> wrote:
>>>>>>>>>>>>>> > > > >>> __
>>>>>>>>>>>>>> > > > >>> > Adds overhead in the hot path due to maintaining 
>>>>>>>>>>>>>> > > > >>> > indexes. Extra memory needed during write path and 
>>>>>>>>>>>>>> > > > >>> > compaction.
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> I’d make the same argument about the overhead of 
>>>>>>>>>>>>>> > > > >>> maintaining the index that Jon just made about the 
>>>>>>>>>>>>>> > > > >>> disk space required. The relatively predictable 
>>>>>>>>>>>>>> > > > >>> overhead of maintaining the index as part of the write 
>>>>>>>>>>>>>> > > > >>> and compaction paths is a pro, not a con. Although 
>>>>>>>>>>>>>> > > > >>> you’re not always paying the cost of building a merkle 
>>>>>>>>>>>>>> > > > >>> tree with snapshot repair, it can impact the hot path 
>>>>>>>>>>>>>> > > > >>> and you do have to plan for it.
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> > Verifies index content, not actual data—may miss 
>>>>>>>>>>>>>> > > > >>> > low-probability errors like bit flips
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> Presumably this could be handled by the views 
>>>>>>>>>>>>>> > > > >>> performing repair against each other? You could also 
>>>>>>>>>>>>>> > > > >>> periodically rebuild the index or perform checksums 
>>>>>>>>>>>>>> > > > >>> against the sstable content.
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> > Extra data scan during inconsistency detection
>>>>>>>>>>>>>> > > > >>> > Index: Since the data covered by certain indexes is 
>>>>>>>>>>>>>> > > > >>> > not guaranteed to be fully contained within a single 
>>>>>>>>>>>>>> > > > >>> > node as the topology changes, some data scans may be 
>>>>>>>>>>>>>> > > > >>> > wasted.
>>>>>>>>>>>>>> > > > >>> > Snapshots: No extra data scan
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> This isn’t really the whole story. The amount of 
>>>>>>>>>>>>>> > > > >>> wasted scans on index repairs is negligible. If a 
>>>>>>>>>>>>>> > > > >>> difference is detected with snapshot repairs though, 
>>>>>>>>>>>>>> > > > >>> you have to read the entire partition from both the 
>>>>>>>>>>>>>> > > > >>> view and base table to calculate what needs to be 
>>>>>>>>>>>>>> > > > >>> fixed.
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>>>>>>>>>>>>>> > > > >>>> One practical aspect that isn't immediately obvious 
>>>>>>>>>>>>>> > > > >>>> is the disk space consideration for snapshots.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> When you have a table with a mixed workload using LCS 
>>>>>>>>>>>>>> > > > >>>> or UCS with scaling parameters like L10 and initiate 
>>>>>>>>>>>>>> > > > >>>> a repair, the disk usage will increase as long as the 
>>>>>>>>>>>>>> > > > >>>> snapshot persists and the table continues to receive 
>>>>>>>>>>>>>> > > > >>>> writes. This aspect is understood and factored into 
>>>>>>>>>>>>>> > > > >>>> the design.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> However, a more nuanced point is the necessity to 
>>>>>>>>>>>>>> > > > >>>> maintain sufficient disk headroom specifically for 
>>>>>>>>>>>>>> > > > >>>> running repairs. This echoes the challenge with STCS 
>>>>>>>>>>>>>> > > > >>>> compaction, where enough space must be available to 
>>>>>>>>>>>>>> > > > >>>> accommodate the largest SSTables, even when they are 
>>>>>>>>>>>>>> > > > >>>> not being actively compacted.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> For example, if a repair involves rewriting 100GB of 
>>>>>>>>>>>>>> > > > >>>> SSTable data, you'll consistently need to reserve 
>>>>>>>>>>>>>> > > > >>>> 100GB of free space to facilitate this.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> Therefore, while the snapshot-based approach leads to 
>>>>>>>>>>>>>> > > > >>>> variable disk space utilization, operators must 
>>>>>>>>>>>>>> > > > >>>> provision storage as if the maximum potential space 
>>>>>>>>>>>>>> > > > >>>> will be used at all times to ensure repairs can be 
>>>>>>>>>>>>>> > > > >>>> executed.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> This introduces a rate of churn dynamic, where the 
>>>>>>>>>>>>>> > > > >>>> write throughput dictates the required extra disk 
>>>>>>>>>>>>>> > > > >>>> space, rather than the existing on-disk data volume.
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> If 50% of your SSTables are rewritten during a 
>>>>>>>>>>>>>> > > > >>>> snapshot, you would need 50% free disk space. 
>>>>>>>>>>>>>> > > > >>>> Depending on the workload, the snapshot method could 
>>>>>>>>>>>>>> > > > >>>> consume significantly more disk space than an 
>>>>>>>>>>>>>> > > > >>>> index-based approach. Conversely, for relatively 
>>>>>>>>>>>>>> > > > >>>> static workloads, the index method might require more 
>>>>>>>>>>>>>> > > > >>>> space. It's not as straightforward as stating "No 
>>>>>>>>>>>>>> > > > >>>> extra disk space needed".
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> Jon
>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu 
>>>>>>>>>>>>>> > > > >>>> <[email protected]> wrote:
>>>>>>>>>>>>>> > > > >>>>> > Regarding your comparison between approaches, I 
>>>>>>>>>>>>>> > > > >>>>> > think you also need to take into account the other 
>>>>>>>>>>>>>> > > > >>>>> > dimensions that have been brought up in this 
>>>>>>>>>>>>>> > > > >>>>> > thread. Things like minimum repair times and 
>>>>>>>>>>>>>> > > > >>>>> > vulnerability to outages and topology changes are 
>>>>>>>>>>>>>> > > > >>>>> > the first that come to mind.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Sure, I added a few more points.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> *Perspective*
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> *Index-Based Solution*
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> *Snapshot-Based Solution*
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 1. Hot path overhead
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Adds overhead in the hot path due to maintaining 
>>>>>>>>>>>>>> > > > >>>>> indexes. Extra memory needed during write path and 
>>>>>>>>>>>>>> > > > >>>>> compaction.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> No impact on the hot path
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 2. Extra disk usage when repair is not running
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space to store persistent 
>>>>>>>>>>>>>> > > > >>>>> indexes
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> No extra disk space needed
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 3. Extra disk usage during repair
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Minimal or no additional disk usage
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space for snapshots
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 4. Fine-grained repair  to deal with emergency 
>>>>>>>>>>>>>> > > > >>>>> situations / topology changes
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Supports fine-grained repairs by targeting specific 
>>>>>>>>>>>>>> > > > >>>>> index ranges. This allows repair to be retried on 
>>>>>>>>>>>>>> > > > >>>>> smaller data sets, enabling incremental progress 
>>>>>>>>>>>>>> > > > >>>>> when repairing the entire table. This is especially 
>>>>>>>>>>>>>> > > > >>>>> helpful when there are down nodes or topology 
>>>>>>>>>>>>>> > > > >>>>> changes during repair, which are common in 
>>>>>>>>>>>>>> > > > >>>>> day-to-day operations.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Coordination across all nodes is required over a 
>>>>>>>>>>>>>> > > > >>>>> long period of time. For each round of repair, if 
>>>>>>>>>>>>>> > > > >>>>> all replica nodes are down or if there is a topology 
>>>>>>>>>>>>>> > > > >>>>> change, the data ranges that were not covered will 
>>>>>>>>>>>>>> > > > >>>>> need to be repaired in the next round.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 5. Validating data used in reads directly
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Verifies index content, not actual data—may miss 
>>>>>>>>>>>>>> > > > >>>>> low-probability errors like bit flips
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Verifies actual data content, providing stronger 
>>>>>>>>>>>>>> > > > >>>>> correctness guarantees
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 6. Extra data scan during inconsistency detection
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Since the data covered by certain indexes is not 
>>>>>>>>>>>>>> > > > >>>>> guaranteed to be fully contained within a single 
>>>>>>>>>>>>>> > > > >>>>> node as the topology changes, some data scans may be 
>>>>>>>>>>>>>> > > > >>>>> wasted.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> No extra data scan
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 7. The overhead of actual data repair after an 
>>>>>>>>>>>>>> > > > >>>>> inconsistency is detected
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Only indexes are streamed to the base table node, 
>>>>>>>>>>>>>> > > > >>>>> and the actual data being fixed can be as accurate 
>>>>>>>>>>>>>> > > > >>>>> as the row level.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Anti-compaction is needed on the MV table, and 
>>>>>>>>>>>>>> > > > >>>>> potential over-streaming may occur due to the lack 
>>>>>>>>>>>>>> > > > >>>>> of row-level insight into data quality.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> > one of my biggest concerns I haven't seen 
>>>>>>>>>>>>>> > > > >>>>> > discussed much is LOCAL_SERIAL/SERIAL on read
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> Paxos v2 introduces an optimization where serial 
>>>>>>>>>>>>>> > > > >>>>> reads can be completed in just one round trip, 
>>>>>>>>>>>>>> > > > >>>>> reducing latency compared to traditional Paxos which 
>>>>>>>>>>>>>> > > > >>>>> may require multiple phases.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> > I think a refresh would be low-cost and give users 
>>>>>>>>>>>>>> > > > >>>>> > the flexibility to run them however they want.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>>>> I think this is an interesting idea. Does it suggest 
>>>>>>>>>>>>>> > > > >>>>> that the MV should be rebuilt on a regular schedule? 
>>>>>>>>>>>>>> > > > >>>>> It sounds like an extension of the snapshot-based 
>>>>>>>>>>>>>> > > > >>>>> approach—rather than detecting mismatches, we would 
>>>>>>>>>>>>>> > > > >>>>> periodically reconstruct a clean version of the MV 
>>>>>>>>>>>>>> > > > >>>>> based on the snapshot. This seems to diverge from 
>>>>>>>>>>>>>> > > > >>>>> the current MV model in Cassandra, where consistency 
>>>>>>>>>>>>>> > > > >>>>> between the MV and base table must be maintained 
>>>>>>>>>>>>>> > > > >>>>> continuously. This could be an extension of the 
>>>>>>>>>>>>>> > > > >>>>> CEP-48 work, where the MV is periodically rebuilt 
>>>>>>>>>>>>>> > > > >>>>> from a snapshot of the base table, assuming the user 
>>>>>>>>>>>>>> > > > >>>>> can tolerate some level of staleness in the MV data.
>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to