> Sorry, Blake—I was traveling last week and couldn’t reply to your email > sooner.
No worries, I’ll be taking some time off soon as well. > I don’t think the first or second option is ideal. We should treat the base > table as the source of truth. Modifying it—or forcing an update on it, even > if it’s just a timestamp change—is not a good approach and won’t solve all > problems. I agree the first option probably isn’t the right way to go. Could you say a bit more about why the second option is not a good approach and which problems it won’t solve? On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote: > Sorry, Blake—I was traveling last week and couldn’t reply to your email > sooner. > > > First - we interpret view data with higher timestamps than the base table > > as data that’s missing from the base and replicate it into the base table. > > The timestamp of the missing data may be below the paxos timestamp low > > bound so we’d have to adjust the paxos coordination logic to allow that in > > this case. Depending on how the view got this way it may also tear writes > > to the base table, breaking the write atomicity promise. > > As discussed earlier, we want this MV repair mechanism to handle all edge > cases. However, it would be difficult to design it in a way that detects the > root cause of each mismatch and repairs it accordingly. Additionally, as you > mentioned, this approach could introduce other issues, such as violating the > write atomicity guarantee. > > > Second - If this happens it means that we’ve either lost base table data or > > paxos metadata. If that happened, we could force a base table update that > > rewrites the current base state with new timestamps making the extra view > > data removable. However this wouldn’t fix the case where the view cell has > > a timestamp from the future - although that’s not a case that C* can fix > > today either. > > I don’t think the first or second option is ideal. We should treat the base > table as the source of truth. Modifying it—or forcing an update on it, even > if it’s just a timestamp change—is not a good approach and won’t solve all > problems. > > > the idea to use anti-compaction makes a lot more sense now (in principle - > > I don’t think it’s workable in practice) > > I have one question regarding anti-compaction. Is the main concern that > processing too much data during anti-compaction could cause issues for the > cluster? > > > I guess you could add some sort of assassin cell that is meant to remove a > > cell with a specific timestamp and value, but is otherwise invisible. > > The idea of the assassination cell is interesting. To prevent data from being > incorrectly removed during the repair process, we need to ensure a quorum of > nodes is available and agrees on the same value before repairing a > materialized view (MV) row or cell. However, this could be very expensive, as > it requires coordination to repair even a single row. > > I think there are a few key differences between MV repair and normal > anti-entropy repair: > > 1. For normal anti-entropy repair, there is no single source of truth. > All replicas act as sources of truth, and eventual consistency is achieved by > merging data across them. Even if some data is lost or bugs result in future > timestamps, the replicas will eventually converge on the same (possibly > incorrect) value, and that value becomes the accepted truth. In contrast, MV > repair relies on the base table as the source of truth and modifies the MV if > inconsistencies are detected. This means that in some cases, simply merging > data won’t resolve the issue, since Cassandra resolves conflicts using > timestamps and there’s no guarantee the base table will always "win" unless > we change the MV merging logic—which I’ve been trying to avoid. > > 2. I see MV repair as more of an on-demand operation, whereas normal > anti-entropy repair needs to run regularly. This means we shouldn’t treat MV > repair the same way as existing repairs. When an operator initiates MV > repair, they need to ensure that sufficient resources are available to > support it. > > > On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <bl...@ultrablake.com> wrote: >> __ >> Got it, thanks for clearing that up. I’d seen the strict liveness code >> around but didn’t realize it was MV related and hadn’t dug into what it did >> or how it worked. >> >> I think you’re right about the row liveness update working for extra data >> with timestamps lower than the most recent base table update. >> >> I see what you mean about the timestamp from the future case. I thought of 3 >> options: >> >> First - we interpret view data with higher timestamps than the base table as >> data that’s missing from the base and replicate it into the base table. The >> timestamp of the missing data may be below the paxos timestamp low bound so >> we’d have to adjust the paxos coordination logic to allow that in this case. >> Depending on how the view got this way it may also tear writes to the base >> table, breaking the write atomicity promise. >> >> Second - If this happens it means that we’ve either lost base table data or >> paxos metadata. If that happened, we could force a base table update that >> rewrites the current base state with new timestamps making the extra view >> data removable. However this wouldn’t fix the case where the view cell has a >> timestamp from the future - although that’s not a case that C* can fix today >> either. >> >> Third - we add a new tombstone type or some mechanism to delete specific >> cells that doesn’t preclude correct writes with lower timestamps from being >> visible. I’m not sure how this would work, and the idea to use >> anti-compaction makes a lot more sense now (in principle - I don’t think >> it’s workable in practice). I guess you could add some sort of assassin cell >> that is meant to remove a cell with a specific timestamp and value, but is >> otherwise invisible. This seems dangerous though, since it’s likely there’s >> a replication problem that may resolve itself and the repair process would >> actually be removing data that the user intended to write. >> >> Paulo - I don’t think storage changes are off the table, but they do expand >> the scope and risk of the proposal, so we should be careful. >> >> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote: >>> > I’m not sure if this is the only edge case—there may be other issues as >>> well. I’m also unsure whether we should redesign the tombstone handling for >>> MVs, since that would involve changes to the storage engine. To minimize >>> impact there, the original proposal was to rebuild the affected ranges >>> using anti-compaction, just to be safe. >>> >>> I haven't been following the discussion but I think one of the issues with >>> the materialized view "strict liveness" fix[1] is that we avoided making >>> invasive changes to the storage engine at the time, but this was considered >>> by Zhao on [1]. I think we shouldn't be trying to avoid updates to the >>> storage format as part of the MV implementation, if this is what it takes >>> to make MVs V2 reliable. >>> >>> [1] - >>> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603 >>> >>> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <curly...@gmail.com> wrote: >>>> The current design leverages strict liveness to shadow the old view row. >>>> When the view-indexed value changes from 'a' to 'b', no tombstone is >>>> written; instead, the old row is marked as expired by updating its >>>> liveness info with the timestamp of the change. If the column is later set >>>> back to 'a', the view row is re-inserted with a new, non-expired liveness >>>> info reflecting the latest timestamp. >>>> >>>> To delete an extra row in the materialized view (MV), we can likely use >>>> the same approach—marking it as shadowed by updating the liveness info. >>>> However, in the case of inconsistencies where a column in the MV has a >>>> higher timestamp than the corresponding column in the base table, this >>>> row-level liveness mechanism is insufficient. >>>> >>>> Even for the case where we delete the row by marking its liveness info as >>>> expired during repair, there are concerns. Since this introduces a data >>>> mutation as part of the repair process, it’s unclear whether there could >>>> be edge cases we’re missing. This approach may risk unexpected side >>>> effects if the repair logic is not carefully aligned with write path >>>> semantics. >>>> >>>> >>>> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston <bl...@ultrablake.com> >>>> wrote: >>>>> __ >>>>> That’s a good point, although as described I don’t think that could ever >>>>> work properly, even in normal operation. Either we’re misunderstanding >>>>> something, or this is a flaw in the current MV design. >>>>> >>>>> Assuming changing the view indexed column results in a tombstone being >>>>> applied to the view row for the previous value, if we wrote the other >>>>> base columns (the non view indexed ones) to the view with the same >>>>> timestamps they have on the base, then changing the view indexed value >>>>> from ‘a’ to ‘b’, then back to ‘a’ would always cause this problem. I >>>>> think you’d need to always update the column timestamps on the view to be >>>>> >= the view column timestamp on the base >>>>> >>>>> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote: >>>>>> > In the case of a missed update, we'll have a new value and we can send >>>>>> > a tombstone to the view with the timestamp of the most recent update. >>>>>> >>>>>> > then something has gone wrong and we should issue a tombstone using >>>>>> > the paxos repair timestamp as the tombstone timestamp. >>>>>> >>>>>> The current MV implementation uses “strict liveness” to determine >>>>>> whether a row is live. I believe that using regular tombstones during >>>>>> repair could cause problems. For example, consider a base table with >>>>>> schema (pk, ck, v1, v2) and a materialized view with schema (v1, pk, ck) >>>>>> -> v2. If, for some reason, we detect an extra row in the MV and delete >>>>>> it using a tombstone with the latest update timestamp, we may run into >>>>>> issues. Suppose we later update the base table’s v1 field to match the >>>>>> MV row we previously deleted, and the v2 value now has an older >>>>>> timestamp. In that case, the previously issued tombstone could still >>>>>> shadow the v2 column, which is unintended. That is why I was asking if >>>>>> we are going to introduce a new kind of tombstones. I’m not sure if this >>>>>> is the only edge case—there may be other issues as well. I’m also unsure >>>>>> whether we should redesign the tombstone handling for MVs, since that >>>>>> would involve changes to the storage engine. To minimize impact there, >>>>>> the original proposal was to rebuild the affected ranges using >>>>>> anti-compaction, just to be safe. >>>>>> >>>>>> >>>>>> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston <bl...@ultrablake.com> >>>>>> wrote: >>>>>>> __ >>>>>>>> Extra row in MV (assuming the tombstone is gone in the base table) — >>>>>>>> how should we fix this? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> This would mean that the base table had either updated or deleted a row >>>>>>> and the view didn't receive the corresponding delete. >>>>>>> >>>>>>> In the case of a missed update, we'll have a new value and we can send >>>>>>> a tombstone to the view with the timestamp of the most recent update. >>>>>>> Since timestamps issued by paxos and accord writes are always >>>>>>> increasing monotonically and don't have collisions, this is safe. >>>>>>> >>>>>>> In the case of a row deletion, we'd also want to send a tombstone with >>>>>>> the same timestamp, however since tombstones can be purged, we may not >>>>>>> have that information and would have to treat it like the view has a >>>>>>> higher timestamp than the base table. >>>>>>> >>>>>>>> Inconsistency (timestamps don’t match) — it’s easy to fix when the >>>>>>>> base table has higher timestamps, but how do we resolve it when the MV >>>>>>>> columns have higher timestamps? >>>>>>>> >>>>>>> >>>>>>> There are 2 ways this could happen. First is that a write failed and >>>>>>> paxos repair hasn't completed it, which is expected, and the second is >>>>>>> a replication bug or base table data loss. You'd need to compare the >>>>>>> view timestamp to the paxos repair history to tell which it is. If the >>>>>>> view timestamp is higher than the most recent paxos repair timestamp >>>>>>> for the key, then it may just be a failed write and we should do >>>>>>> nothing. If the view timestamp is less than the most recent paxos >>>>>>> repair timestamp for that key and higher than the base timestamp, then >>>>>>> something has gone wrong and we should issue a tombstone using the >>>>>>> paxos repair timestamp as the tombstone timestamp. This is safe to do >>>>>>> because the paxos repair timestamps act as a low bound for ballots >>>>>>> paxos will process, so it wouldn't be possible for a legitimate write >>>>>>> to be shadowed by this tombstone. >>>>>>> >>>>>>>> Do we need to introduce a new kind of tombstone to shadow the rows in >>>>>>>> the MV for cases 2 and 3? If yes, how will this tombstone work? If no, >>>>>>>> how should we fix the MV data? >>>>>>>> >>>>>>> >>>>>>> No, a normal tombstone would work. >>>>>>> >>>>>>> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote: >>>>>>>> Okay, let’s put the efficiency discussion on hold for now. I want to >>>>>>>> make sure the actual repair process after detecting inconsistencies >>>>>>>> will work with the index-based solution. >>>>>>>> >>>>>>>> When a mismatch is detected, the MV replica will need to stream its >>>>>>>> index file to the base table replica. The base table will then perform >>>>>>>> a comparison between the two files. >>>>>>>> >>>>>>>> There are three cases we need to handle: >>>>>>>> >>>>>>>> 1. Missing row in MV — this is straightforward; we can propagate the >>>>>>>> data to the MV. >>>>>>>> >>>>>>>> 2. Extra row in MV (assuming the tombstone is gone in the base table) >>>>>>>> — how should we fix this? >>>>>>>> >>>>>>>> 3. Inconsistency (timestamps don’t match) — it’s easy to fix when the >>>>>>>> base table has higher timestamps, but how do we resolve it when the MV >>>>>>>> columns have higher timestamps? >>>>>>>> >>>>>>>> Do we need to introduce a new kind of tombstone to shadow the rows in >>>>>>>> the MV for cases 2 and 3? If yes, how will this tombstone work? If no, >>>>>>>> how should we fix the MV data? >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <bl...@ultrablake.com> >>>>>>>> wrote: >>>>>>>>> __ >>>>>>>>> > hopefully we can come up with a solution that everyone agrees on. >>>>>>>>> >>>>>>>>> I’m sure we can, I think we’ve been making good progress >>>>>>>>> >>>>>>>>> > My main concern with the index-based solution is the overhead it >>>>>>>>> > adds to the hot path, as well as having to build indexes >>>>>>>>> > periodically. >>>>>>>>> >>>>>>>>> So the additional overhead of maintaining a storage attached index on >>>>>>>>> the client write path is pretty minimal - it’s basically adding data >>>>>>>>> to an in memory trie. It’s a little extra work and memory usage, but >>>>>>>>> there isn’t any extra io or other blocking associated with it. I’d >>>>>>>>> expect the latency impact to be negligible. >>>>>>>>> >>>>>>>>> > As mentioned earlier, this MV repair should be an infrequent >>>>>>>>> > operation >>>>>>>>> >>>>>>>>> I don’t this that’s a safe assumption. There are a lot of situations >>>>>>>>> outside of data loss bugs where repair would need to be run. >>>>>>>>> >>>>>>>>> These use cases could probably be handled by repairing the view with >>>>>>>>> other view replicas: >>>>>>>>> >>>>>>>>> Scrubbing corrupt sstables >>>>>>>>> Node replacement via backup >>>>>>>>> >>>>>>>>> These use cases would need an actual MV repair to check consistency >>>>>>>>> with the base table: >>>>>>>>> >>>>>>>>> Restoring a cluster from a backup >>>>>>>>> Imported sstables via nodetool import >>>>>>>>> Data loss from operator error >>>>>>>>> Proactive consistency checks - ie preview repairs >>>>>>>>> >>>>>>>>> Even if it is an infrequent operation, when operators need it, it >>>>>>>>> needs to be available and reliable. >>>>>>>>> >>>>>>>>> It’s a fact that there are clusters where non-incremental repairs are >>>>>>>>> run on a cadence of a week or more to manage the overhead of >>>>>>>>> validation compactions. Assuming the cluster doesn’t have any >>>>>>>>> additional headroom, that would mean that any one of the above events >>>>>>>>> could cause views to remain out of sync for up to a week while the >>>>>>>>> full set of merkle trees is being built. >>>>>>>>> >>>>>>>>> This delay eliminates a lot of the value of repair as a risk >>>>>>>>> mitigation tool. If I had to make a recommendation where a bad call >>>>>>>>> could cost me my job, the prospect of a 7 day delay on repair would >>>>>>>>> mean a strong no. >>>>>>>>> >>>>>>>>> Some users also run preview repair continuously to detect data >>>>>>>>> consistency errors, so at least a subset of users will probably be >>>>>>>>> running MV repairs continuously - at least in preview mode. >>>>>>>>> >>>>>>>>> That’s why I say that the replication path should be designed to >>>>>>>>> never need repair, and MV repair should be designed to be prepared >>>>>>>>> for the worst. >>>>>>>>> >>>>>>>>> > I’m wondering if it’s possible to enable or disable index building >>>>>>>>> > dynamically so that we don’t always incur the cost for something >>>>>>>>> > that’s rarely needed. >>>>>>>>> >>>>>>>>> I think this would be a really reasonable compromise as long as the >>>>>>>>> default is on. That way it’s as safe as possible by default, but >>>>>>>>> users who don’t care or have a separate system for repairing MVs can >>>>>>>>> opt out. >>>>>>>>> >>>>>>>>> > I’m not sure what you mean by “data problems” here. >>>>>>>>> >>>>>>>>> I mean out of sync views - either due to bugs, operator error, >>>>>>>>> corruption, etc >>>>>>>>> >>>>>>>>> > Also, this does scale with cluster size—I’ve compared it to full >>>>>>>>> > repair, and this MV repair should behave similarly. That means as >>>>>>>>> > long as full repair works, this repair should work as well. >>>>>>>>> >>>>>>>>> You could build the merkle trees at about the same cost as a full >>>>>>>>> repair, but the actual data repair path is completely different for >>>>>>>>> MV, and that’s the part that doesn’t scale well. As you know, with >>>>>>>>> normal repair, we just stream data for ranges detected as out of >>>>>>>>> sync. For Mvs, since the data isn’t in base partition order, the view >>>>>>>>> data for an out of sync view range needs to be read out and streamed >>>>>>>>> to every base replica that it’s detected a mismatch against. So in >>>>>>>>> the example I gave with the 300 node cluster, you’re looking at >>>>>>>>> reading and transmitting the same partition at least 100 times in the >>>>>>>>> best case, and the cost of this keeps going up as the cluster >>>>>>>>> increases in size. That's the part that doesn't scale well. >>>>>>>>> >>>>>>>>> This is also one the benefits of the index design. Since it stores >>>>>>>>> data in segments that roughly correspond to points on the grid, >>>>>>>>> you’re not rereading the same data over and over. A repair for a >>>>>>>>> given grid point only reads an amount of data proportional to the >>>>>>>>> data in common for the base/view grid point, and it’s stored in a >>>>>>>>> small enough granularity that the base can calculate what data needs >>>>>>>>> to be sent to the view without having to read the entire view >>>>>>>>> partition. >>>>>>>>> >>>>>>>>> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote: >>>>>>>>>> Thanks, Blake. I’m open to iterating on the design, and hopefully we >>>>>>>>>> can come up with a solution that everyone agrees on. >>>>>>>>>> >>>>>>>>>> My main concern with the index-based solution is the overhead it >>>>>>>>>> adds to the hot path, as well as having to build indexes >>>>>>>>>> periodically. As mentioned earlier, this MV repair should be an >>>>>>>>>> infrequent operation, but the index-based approach shifts some of >>>>>>>>>> the work to the hot path in order to allow repairs that touch only a >>>>>>>>>> few nodes. >>>>>>>>>> >>>>>>>>>> I’m wondering if it’s possible to enable or disable index building >>>>>>>>>> dynamically so that we don’t always incur the cost for something >>>>>>>>>> that’s rarely needed. >>>>>>>>>> >>>>>>>>>> > it degrades operators ability to react to data problems by >>>>>>>>>> > imposing a significant upfront processing burden on repair, and >>>>>>>>>> > that it doesn’t scale well with cluster size >>>>>>>>>> >>>>>>>>>> I’m not sure what you mean by “data problems” here. Also, this does >>>>>>>>>> scale with cluster size—I’ve compared it to full repair, and this MV >>>>>>>>>> repair should behave similarly. That means as long as full repair >>>>>>>>>> works, this repair should work as well. >>>>>>>>>> >>>>>>>>>> For example, regardless of how large the cluster is, you can always >>>>>>>>>> enable Merkle tree building on 10% of the nodes at a time until all >>>>>>>>>> the trees are ready. >>>>>>>>>> >>>>>>>>>> I understand that coordinating this type of repair is harder than >>>>>>>>>> what we currently support, but with CEP-37, we should be able to >>>>>>>>>> handle this coordination without adding too much burden on the >>>>>>>>>> operator side. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston >>>>>>>>>> <bl...@ultrablake.com> wrote: >>>>>>>>>>> __ >>>>>>>>>>>> I don't see any outcome here that is good for the community >>>>>>>>>>>> though. Either Runtian caves and adopts your design that he (and >>>>>>>>>>>> I) consider inferior, or he is prevented from contributing this >>>>>>>>>>>> work. >>>>>>>>>>> >>>>>>>>>>> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a >>>>>>>>>>> competition. We can collaborate and figure out the best approach to >>>>>>>>>>> the problem. I’d like to keep discussing it if you’re open to >>>>>>>>>>> iterating on the design. >>>>>>>>>>> >>>>>>>>>>> I’m not married to our proposal, it’s just the cleanest way we >>>>>>>>>>> could think of to address what Jon and I both see as blockers in >>>>>>>>>>> the current proposal. It’s not set in stone though. >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote: >>>>>>>>>>>> Hmm, I am very surprised as I helped write that and I distinctly >>>>>>>>>>>> recall a specific goal was avoiding binding vetoes as they're so >>>>>>>>>>>> toxic. >>>>>>>>>>>> >>>>>>>>>>>> Ok, I guess you can block this work if you like. >>>>>>>>>>>> >>>>>>>>>>>> I don't see any outcome here that is good for the community >>>>>>>>>>>> though. Either Runtian caves and adopts your design that he (and >>>>>>>>>>>> I) consider inferior, or he is prevented from contributing this >>>>>>>>>>>> work. That isn't a functioning community in my mind, so I'll be >>>>>>>>>>>> noping out for a while, as I don't see much value here right now. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2025/06/06 18:31:08 Blake Eggleston wrote: >>>>>>>>>>>> > Hi Benedict, that’s actually not true. >>>>>>>>>>>> > >>>>>>>>>>>> > Here’s a link to the project governance page: >>>>>>>>>>>> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_ >>>>>>>>>>>> > >>>>>>>>>>>> > The CEP section says: >>>>>>>>>>>> > >>>>>>>>>>>> > “*Once the proposal is finalized and any major committer dissent >>>>>>>>>>>> > reconciled, call a [VOTE] on the ML to have the proposal >>>>>>>>>>>> > adopted. The criteria for acceptance is consensus (3 binding +1 >>>>>>>>>>>> > votes and no binding vetoes). The vote should remain open for 72 >>>>>>>>>>>> > hours.*” >>>>>>>>>>>> > >>>>>>>>>>>> > So they’re definitely vetoable. >>>>>>>>>>>> > >>>>>>>>>>>> > Also note the part about “*Once the proposal is finalized and >>>>>>>>>>>> > any major committer dissent reconciled,*” being a prerequisite >>>>>>>>>>>> > for moving a CEP to [VOTE]. Given the as yet unreconciled >>>>>>>>>>>> > committer dissent, it wouldn’t even be appropriate to move to a >>>>>>>>>>>> > VOTE until we get to the bottom of this repair discussion. >>>>>>>>>>>> > >>>>>>>>>>>> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote: >>>>>>>>>>>> > > > but the snapshot repair design is not a viable path forward. >>>>>>>>>>>> > > > It’s the first iteration of a repair design. We’ve proposed >>>>>>>>>>>> > > > a second iteration, and we’re open to a third iteration. >>>>>>>>>>>> > > >>>>>>>>>>>> > > I shan't be participating further in discussion, but I want to >>>>>>>>>>>> > > make a point of order. The CEP process has no vetoes, so you >>>>>>>>>>>> > > are not empowered to declare that a design is not viable >>>>>>>>>>>> > > without the input of the wider community. >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > On 2025/06/05 03:58:59 Blake Eggleston wrote: >>>>>>>>>>>> > > > You can detect and fix the mismatch in a single round of >>>>>>>>>>>> > > > repair, but the amount of work needed to do it is >>>>>>>>>>>> > > > _significantly_ higher with snapshot repair. Consider a case >>>>>>>>>>>> > > > where we have a 300 node cluster w/ RF 3, where each view >>>>>>>>>>>> > > > partition contains entries mapping to every token range in >>>>>>>>>>>> > > > the cluster - so 100 ranges. If we lose a view sstable, it >>>>>>>>>>>> > > > will affect an entire row/column of the grid. Repair is >>>>>>>>>>>> > > > going to scan all data in the mismatching view token ranges >>>>>>>>>>>> > > > 100 times, and each base range once. So you’re looking at >>>>>>>>>>>> > > > 200 range scans. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Now, you may argue that you can merge the duplicate view >>>>>>>>>>>> > > > scans into a single scan while you repair all token ranges >>>>>>>>>>>> > > > in parallel. I’m skeptical that’s going to be achievable in >>>>>>>>>>>> > > > practice, but even if it is, we’re now talking about the >>>>>>>>>>>> > > > view replica hypothetically doing a pairwise repair with >>>>>>>>>>>> > > > every other replica in the cluster at the same time. Neither >>>>>>>>>>>> > > > of these options is workable. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Let’s take a step back though, because I think we’re getting >>>>>>>>>>>> > > > lost in the weeds. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > The repair design in the CEP has some high level concepts >>>>>>>>>>>> > > > that make a lot of sense, the idea of repairing a grid is >>>>>>>>>>>> > > > really smart. However, it has some significant drawbacks >>>>>>>>>>>> > > > that remain unaddressed. I want this CEP to succeed, and I >>>>>>>>>>>> > > > know Jon does too, but the snapshot repair design is not a >>>>>>>>>>>> > > > viable path forward. It’s the first iteration of a repair >>>>>>>>>>>> > > > design. We’ve proposed a second iteration, and we’re open to >>>>>>>>>>>> > > > a third iteration. This part of the CEP process is meant to >>>>>>>>>>>> > > > identify and address shortcomings, I don’t think that >>>>>>>>>>>> > > > continuing to dissect the snapshot repair design is making >>>>>>>>>>>> > > > progress in that direction. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: >>>>>>>>>>>> > > > > > We potentially have to do it several times on each >>>>>>>>>>>> > > > > > node, depending on the size of the range. Smaller ranges >>>>>>>>>>>> > > > > > increase the size of the board exponentially, larger >>>>>>>>>>>> > > > > > ranges increase the number of SSTables that would be >>>>>>>>>>>> > > > > > involved in each compaction. >>>>>>>>>>>> > > > > As described in the CEP example, this can be handled in a >>>>>>>>>>>> > > > > single round of repair. We first identify all the points >>>>>>>>>>>> > > > > in the grid that require repair, then perform >>>>>>>>>>>> > > > > anti-compaction and stream data based on a second scan >>>>>>>>>>>> > > > > over those identified points. This applies to the >>>>>>>>>>>> > > > > snapshot-based solution—without an index, repairing a >>>>>>>>>>>> > > > > single point in that grid requires scanning the entire >>>>>>>>>>>> > > > > base table partition (token range). In contrast, with the >>>>>>>>>>>> > > > > index-based solution—as in the example you referenced—if a >>>>>>>>>>>> > > > > large block of data is corrupted, even though the index is >>>>>>>>>>>> > > > > used for comparison, many key mismatches may occur. This >>>>>>>>>>>> > > > > can lead to random disk access to the original data files, >>>>>>>>>>>> > > > > which could cause performance issues. For the case you >>>>>>>>>>>> > > > > mentioned for snapshot based solution, it should not take >>>>>>>>>>>> > > > > months to repair all the data, instead one round of repair >>>>>>>>>>>> > > > > should be enough. The actual repair phase is split from >>>>>>>>>>>> > > > > the detection phase. >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad >>>>>>>>>>>> > > > > <j...@rustyrazorblade.com> wrote: >>>>>>>>>>>> > > > >> > This isn’t really the whole story. The amount of wasted >>>>>>>>>>>> > > > >> > scans on index repairs is negligible. If a difference >>>>>>>>>>>> > > > >> > is detected with snapshot repairs though, you have to >>>>>>>>>>>> > > > >> > read the entire partition from both the view and base >>>>>>>>>>>> > > > >> > table to calculate what needs to be fixed. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> You nailed it. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> When the base table is converted to a view, and sent to >>>>>>>>>>>> > > > >> the view, the information we have is that one of the >>>>>>>>>>>> > > > >> view's partition keys needs a repair. That's going to be >>>>>>>>>>>> > > > >> different from the partition key of the base table. As a >>>>>>>>>>>> > > > >> result, on the base table, for each affected range, we'd >>>>>>>>>>>> > > > >> have to issue another compaction across the entire set of >>>>>>>>>>>> > > > >> sstables that could have the data the view needs >>>>>>>>>>>> > > > >> (potentially many GB), in order to send over the >>>>>>>>>>>> > > > >> corrected version of the partition, then send it over to >>>>>>>>>>>> > > > >> the view. Without an index in place, we have to do yet >>>>>>>>>>>> > > > >> another scan, per-affected range. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> Consider the case of a single corrupted SSTable on the >>>>>>>>>>>> > > > >> view that's removed from the filesystem, or the data is >>>>>>>>>>>> > > > >> simply missing after being restored from an inconsistent >>>>>>>>>>>> > > > >> backup. It presumably contains lots of partitions, which >>>>>>>>>>>> > > > >> maps to base partitions all over the cluster, in a lot of >>>>>>>>>>>> > > > >> different token ranges. For every one of those ranges >>>>>>>>>>>> > > > >> (hundreds, to tens of thousands of them given the >>>>>>>>>>>> > > > >> checkerboard design), when finding the missing data in >>>>>>>>>>>> > > > >> the base, you'll have to perform a compaction across all >>>>>>>>>>>> > > > >> the SSTables that potentially contain the missing data >>>>>>>>>>>> > > > >> just to rebuild the view-oriented partitions that need to >>>>>>>>>>>> > > > >> be sent to the view. The complexity of this operation >>>>>>>>>>>> > > > >> can be looked at as O(N*M) where N and M are the number >>>>>>>>>>>> > > > >> of ranges in the base table and the view affected by the >>>>>>>>>>>> > > > >> corruption, respectively. Without an index in place, >>>>>>>>>>>> > > > >> finding the missing data is very expensive. We >>>>>>>>>>>> > > > >> potentially have to do it several times on each node, >>>>>>>>>>>> > > > >> depending on the size of the range. Smaller ranges >>>>>>>>>>>> > > > >> increase the size of the board exponentially, larger >>>>>>>>>>>> > > > >> ranges increase the number of SSTables that would be >>>>>>>>>>>> > > > >> involved in each compaction. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> Then you send that data over to the view, the view does >>>>>>>>>>>> > > > >> it's anti-compaction thing, again, once per affected >>>>>>>>>>>> > > > >> range. So now the view has to do an anti-compaction once >>>>>>>>>>>> > > > >> per block on the board that's affected by the missing >>>>>>>>>>>> > > > >> data. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> Doing hundreds or thousands of these will add up pretty >>>>>>>>>>>> > > > >> quickly. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> When I said that a repair could take months, this is what >>>>>>>>>>>> > > > >> I had in mind. >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston >>>>>>>>>>>> > > > >> <bl...@ultrablake.com> wrote: >>>>>>>>>>>> > > > >>> __ >>>>>>>>>>>> > > > >>> > Adds overhead in the hot path due to maintaining >>>>>>>>>>>> > > > >>> > indexes. Extra memory needed during write path and >>>>>>>>>>>> > > > >>> > compaction. >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> I’d make the same argument about the overhead of >>>>>>>>>>>> > > > >>> maintaining the index that Jon just made about the disk >>>>>>>>>>>> > > > >>> space required. The relatively predictable overhead of >>>>>>>>>>>> > > > >>> maintaining the index as part of the write and >>>>>>>>>>>> > > > >>> compaction paths is a pro, not a con. Although you’re >>>>>>>>>>>> > > > >>> not always paying the cost of building a merkle tree >>>>>>>>>>>> > > > >>> with snapshot repair, it can impact the hot path and you >>>>>>>>>>>> > > > >>> do have to plan for it. >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> > Verifies index content, not actual data—may miss >>>>>>>>>>>> > > > >>> > low-probability errors like bit flips >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> Presumably this could be handled by the views performing >>>>>>>>>>>> > > > >>> repair against each other? You could also periodically >>>>>>>>>>>> > > > >>> rebuild the index or perform checksums against the >>>>>>>>>>>> > > > >>> sstable content. >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> > Extra data scan during inconsistency detection >>>>>>>>>>>> > > > >>> > Index: Since the data covered by certain indexes is >>>>>>>>>>>> > > > >>> > not guaranteed to be fully contained within a single >>>>>>>>>>>> > > > >>> > node as the topology changes, some data scans may be >>>>>>>>>>>> > > > >>> > wasted. >>>>>>>>>>>> > > > >>> > Snapshots: No extra data scan >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> This isn’t really the whole story. The amount of wasted >>>>>>>>>>>> > > > >>> scans on index repairs is negligible. If a difference is >>>>>>>>>>>> > > > >>> detected with snapshot repairs though, you have to read >>>>>>>>>>>> > > > >>> the entire partition from both the view and base table >>>>>>>>>>>> > > > >>> to calculate what needs to be fixed. >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: >>>>>>>>>>>> > > > >>>> One practical aspect that isn't immediately obvious is >>>>>>>>>>>> > > > >>>> the disk space consideration for snapshots. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> When you have a table with a mixed workload using LCS >>>>>>>>>>>> > > > >>>> or UCS with scaling parameters like L10 and initiate a >>>>>>>>>>>> > > > >>>> repair, the disk usage will increase as long as the >>>>>>>>>>>> > > > >>>> snapshot persists and the table continues to receive >>>>>>>>>>>> > > > >>>> writes. This aspect is understood and factored into the >>>>>>>>>>>> > > > >>>> design. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> However, a more nuanced point is the necessity to >>>>>>>>>>>> > > > >>>> maintain sufficient disk headroom specifically for >>>>>>>>>>>> > > > >>>> running repairs. This echoes the challenge with STCS >>>>>>>>>>>> > > > >>>> compaction, where enough space must be available to >>>>>>>>>>>> > > > >>>> accommodate the largest SSTables, even when they are >>>>>>>>>>>> > > > >>>> not being actively compacted. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> For example, if a repair involves rewriting 100GB of >>>>>>>>>>>> > > > >>>> SSTable data, you'll consistently need to reserve 100GB >>>>>>>>>>>> > > > >>>> of free space to facilitate this. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> Therefore, while the snapshot-based approach leads to >>>>>>>>>>>> > > > >>>> variable disk space utilization, operators must >>>>>>>>>>>> > > > >>>> provision storage as if the maximum potential space >>>>>>>>>>>> > > > >>>> will be used at all times to ensure repairs can be >>>>>>>>>>>> > > > >>>> executed. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> This introduces a rate of churn dynamic, where the >>>>>>>>>>>> > > > >>>> write throughput dictates the required extra disk >>>>>>>>>>>> > > > >>>> space, rather than the existing on-disk data volume. >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> If 50% of your SSTables are rewritten during a >>>>>>>>>>>> > > > >>>> snapshot, you would need 50% free disk space. Depending >>>>>>>>>>>> > > > >>>> on the workload, the snapshot method could consume >>>>>>>>>>>> > > > >>>> significantly more disk space than an index-based >>>>>>>>>>>> > > > >>>> approach. Conversely, for relatively static workloads, >>>>>>>>>>>> > > > >>>> the index method might require more space. It's not as >>>>>>>>>>>> > > > >>>> straightforward as stating "No extra disk space needed". >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> Jon >>>>>>>>>>>> > > > >>>> >>>>>>>>>>>> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu >>>>>>>>>>>> > > > >>>> <curly...@gmail.com> wrote: >>>>>>>>>>>> > > > >>>>> > Regarding your comparison between approaches, I >>>>>>>>>>>> > > > >>>>> > think you also need to take into account the other >>>>>>>>>>>> > > > >>>>> > dimensions that have been brought up in this thread. >>>>>>>>>>>> > > > >>>>> > Things like minimum repair times and vulnerability >>>>>>>>>>>> > > > >>>>> > to outages and topology changes are the first that >>>>>>>>>>>> > > > >>>>> > come to mind. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Sure, I added a few more points. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> *Perspective* >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> *Index-Based Solution* >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> *Snapshot-Based Solution* >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 1. Hot path overhead >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Adds overhead in the hot path due to maintaining >>>>>>>>>>>> > > > >>>>> indexes. Extra memory needed during write path and >>>>>>>>>>>> > > > >>>>> compaction. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> No impact on the hot path >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 2. Extra disk usage when repair is not running >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Requires additional disk space to store persistent >>>>>>>>>>>> > > > >>>>> indexes >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> No extra disk space needed >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 3. Extra disk usage during repair >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Minimal or no additional disk usage >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Requires additional disk space for snapshots >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 4. Fine-grained repair to deal with emergency >>>>>>>>>>>> > > > >>>>> situations / topology changes >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Supports fine-grained repairs by targeting specific >>>>>>>>>>>> > > > >>>>> index ranges. This allows repair to be retried on >>>>>>>>>>>> > > > >>>>> smaller data sets, enabling incremental progress when >>>>>>>>>>>> > > > >>>>> repairing the entire table. This is especially helpful >>>>>>>>>>>> > > > >>>>> when there are down nodes or topology changes during >>>>>>>>>>>> > > > >>>>> repair, which are common in day-to-day operations. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Coordination across all nodes is required over a long >>>>>>>>>>>> > > > >>>>> period of time. For each round of repair, if all >>>>>>>>>>>> > > > >>>>> replica nodes are down or if there is a topology >>>>>>>>>>>> > > > >>>>> change, the data ranges that were not covered will >>>>>>>>>>>> > > > >>>>> need to be repaired in the next round. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 5. Validating data used in reads directly >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Verifies index content, not actual data—may miss >>>>>>>>>>>> > > > >>>>> low-probability errors like bit flips >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Verifies actual data content, providing stronger >>>>>>>>>>>> > > > >>>>> correctness guarantees >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 6. Extra data scan during inconsistency detection >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Since the data covered by certain indexes is not >>>>>>>>>>>> > > > >>>>> guaranteed to be fully contained within a single node >>>>>>>>>>>> > > > >>>>> as the topology changes, some data scans may be wasted. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> No extra data scan >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> 7. The overhead of actual data repair after an >>>>>>>>>>>> > > > >>>>> inconsistency is detected >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Only indexes are streamed to the base table node, and >>>>>>>>>>>> > > > >>>>> the actual data being fixed can be as accurate as the >>>>>>>>>>>> > > > >>>>> row level. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Anti-compaction is needed on the MV table, and >>>>>>>>>>>> > > > >>>>> potential over-streaming may occur due to the lack of >>>>>>>>>>>> > > > >>>>> row-level insight into data quality. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> > one of my biggest concerns I haven't seen discussed >>>>>>>>>>>> > > > >>>>> > much is LOCAL_SERIAL/SERIAL on read >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> Paxos v2 introduces an optimization where serial reads >>>>>>>>>>>> > > > >>>>> can be completed in just one round trip, reducing >>>>>>>>>>>> > > > >>>>> latency compared to traditional Paxos which may >>>>>>>>>>>> > > > >>>>> require multiple phases. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> > I think a refresh would be low-cost and give users >>>>>>>>>>>> > > > >>>>> > the flexibility to run them however they want. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>>>> I think this is an interesting idea. Does it suggest >>>>>>>>>>>> > > > >>>>> that the MV should be rebuilt on a regular schedule? >>>>>>>>>>>> > > > >>>>> It sounds like an extension of the snapshot-based >>>>>>>>>>>> > > > >>>>> approach—rather than detecting mismatches, we would >>>>>>>>>>>> > > > >>>>> periodically reconstruct a clean version of the MV >>>>>>>>>>>> > > > >>>>> based on the snapshot. This seems to diverge from the >>>>>>>>>>>> > > > >>>>> current MV model in Cassandra, where consistency >>>>>>>>>>>> > > > >>>>> between the MV and base table must be maintained >>>>>>>>>>>> > > > >>>>> continuously. This could be an extension of the CEP-48 >>>>>>>>>>>> > > > >>>>> work, where the MV is periodically rebuilt from a >>>>>>>>>>>> > > > >>>>> snapshot of the base table, assuming the user can >>>>>>>>>>>> > > > >>>>> tolerate some level of staleness in the MV data. >>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>> > > > >>> >>>>>>>>>>>> > > > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>