Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Jon Haddad
> This isn’t really the whole story. The amount of wasted scans on index
repairs is negligible. If a difference is detected with snapshot repairs
though, you have to read the entire partition from both the view and base
table to calculate what needs to be fixed.

You nailed it.

When the base table is converted to a view, and sent to the view, the
information we have is that one of the view's partition keys needs a
repair.  That's going to be different from the partition key of the base
table.  As a result, on the base table, for each affected range, we'd have
to issue another compaction across the entire set of sstables that could
have the data the view needs (potentially many GB), in order to send over
the corrected version of the partition, then send it over to the view.
Without an index in place, we have to do yet another scan, per-affected
range.

Consider the case of a single corrupted SSTable on the view that's removed
from the filesystem, or the data is simply missing after being restored
from an inconsistent backup.  It presumably contains lots of partitions,
which maps to base partitions all over the cluster, in a lot of different
token ranges.  For every one of those ranges (hundreds, to tens of
thousands of them given the checkerboard design), when finding the missing
data in the base, you'll have to perform a compaction across all the
SSTables that potentially contain the missing data just to rebuild the
view-oriented partitions that need to be sent to the view.  The complexity
of this operation can be looked at as O(N*M) where N and M are the number
of ranges in the base table and the view affected by the corruption,
respectively.  Without an index in place, finding the missing data is very
expensive.  We potentially have to do it several times on each node,
depending on the size of the range.  Smaller ranges increase the size of
the board exponentially, larger ranges increase the number of SSTables that
would be involved in each compaction.

Then you send that data over to the view, the view does it's
anti-compaction thing, again, once per affected range.  So now the view has
to do an anti-compaction once per block on the board that's affected by the
missing data.

Doing hundreds or thousands of these will add up pretty quickly.

When I said that a repair could take months, this is what I had in mind.




On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
wrote:

> > Adds overhead in the hot path due to maintaining indexes. Extra memory
> needed during write path and compaction.
>
> I’d make the same argument about the overhead of maintaining the index
> that Jon just made about the disk space required. The relatively
> predictable overhead of maintaining the index as part of the write and
> compaction paths is a pro, not a con. Although you’re not always paying the
> cost of building a merkle tree with snapshot repair, it can impact the hot
> path and you do have to plan for it.
>
> > Verifies index content, not actual data—may miss low-probability errors
> like bit flips
>
> Presumably this could be handled by the views performing repair against
> each other? You could also periodically rebuild the index or perform
> checksums against the sstable content.
>
> > Extra data scan during inconsistency detection
> > Index: Since the data covered by certain indexes is not guaranteed to be
> fully contained within a single node as the topology changes, some data
> scans may be wasted.
> > Snapshots: No extra data scan
>
> This isn’t really the whole story. The amount of wasted scans on index
> repairs is negligible. If a difference is detected with snapshot repairs
> though, you have to read the entire partition from both the view and base
> table to calculate what needs to be fixed.
>
> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>
> One practical aspect that isn't immediately obvious is the disk space
> consideration for snapshots.
>
> When you have a table with a mixed workload using LCS or UCS with scaling
> parameters like L10 and initiate a repair, the disk usage will increase as
> long as the snapshot persists and the table continues to receive writes.
> This aspect is understood and factored into the design.
>
> However, a more nuanced point is the necessity to maintain sufficient disk
> headroom specifically for running repairs. This echoes the challenge with
> STCS compaction, where enough space must be available to accommodate the
> largest SSTables, even when they are not being actively compacted.
>
> For example, if a repair involves rewriting 100GB of SSTable data, you'll
> consistently need to reserve 100GB of free space to facilitate this.
>
> Therefore, while the snapshot-based approach leads to variable disk space
> utilization, operators must provision storage as if the maximum potential
> space will be used at all times to ensure repairs can be executed.
>
> This introduces a rate of churn dynamic, where the write throughput
> dictates the required extra disk space, rath

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Runtian Liu
>  We potentially have to do it several times on each node, depending on
the size of the range. Smaller ranges increase the size of the board
exponentially, larger ranges increase the number of SSTables that would be
involved in each compaction.
As described in the CEP example, this can be handled in a single round of
repair. We first identify all the points in the grid that require repair,
then perform anti-compaction and stream data based on a second scan over
those identified points. This applies to the snapshot-based
solution—without an index, repairing a single point in that grid requires
scanning the entire base table partition (token range). In contrast, with
the index-based solution—as in the example you referenced—if a large block
of data is corrupted, even though the index is used for comparison, many
key mismatches may occur. This can lead to random disk access to the
original data files, which could cause performance issues. For the case you
mentioned for snapshot based solution, it should not take months to repair
all the data, instead one round of repair should be enough. The actual
repair phase is split from the detection phase.


On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  wrote:

> > This isn’t really the whole story. The amount of wasted scans on index
> repairs is negligible. If a difference is detected with snapshot repairs
> though, you have to read the entire partition from both the view and base
> table to calculate what needs to be fixed.
>
> You nailed it.
>
> When the base table is converted to a view, and sent to the view, the
> information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
>
> Consider the case of a single corrupted SSTable on the view that's removed
> from the filesystem, or the data is simply missing after being restored
> from an inconsistent backup.  It presumably contains lots of partitions,
> which maps to base partitions all over the cluster, in a lot of different
> token ranges.  For every one of those ranges (hundreds, to tens of
> thousands of them given the checkerboard design), when finding the missing
> data in the base, you'll have to perform a compaction across all the
> SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be sent to the view.  The complexity
> of this operation can be looked at as O(N*M) where N and M are the number
> of ranges in the base table and the view affected by the corruption,
> respectively.  Without an index in place, finding the missing data is very
> expensive.  We potentially have to do it several times on each node,
> depending on the size of the range.  Smaller ranges increase the size of
> the board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
>
> Then you send that data over to the view, the view does it's
> anti-compaction thing, again, once per affected range.  So now the view has
> to do an anti-compaction once per block on the board that's affected by the
> missing data.
>
> Doing hundreds or thousands of these will add up pretty quickly.
>
> When I said that a repair could take months, this is what I had in mind.
>
>
>
>
> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
> wrote:
>
>> > Adds overhead in the hot path due to maintaining indexes. Extra memory
>> needed during write path and compaction.
>>
>> I’d make the same argument about the overhead of maintaining the index
>> that Jon just made about the disk space required. The relatively
>> predictable overhead of maintaining the index as part of the write and
>> compaction paths is a pro, not a con. Although you’re not always paying the
>> cost of building a merkle tree with snapshot repair, it can impact the hot
>> path and you do have to plan for it.
>>
>> > Verifies index content, not actual data—may miss low-probability errors
>> like bit flips
>>
>> Presumably this could be handled by the views performing repair against
>> each other? You could also periodically rebuild the index or perform
>> checksums against the sstable content.
>>
>> > Extra data scan during inconsistency detection
>> > Index: Since the data covered by certain indexes is not guaranteed to
>> be fully contained within a single node as the topology changes, some data
>> scans may be wasted.
>> > Snapshots: No extra data scan
>>
>> This isn’t really the whole story. The amount of wasted scans on index
>> repairs is negligible. If a difference is detected with snapshot repairs
>> though, you have to read the

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

2025-06-04 Thread Blake Eggleston
You can detect and fix the mismatch in a single round of repair, but the amount 
of work needed to do it is _significantly_ higher with snapshot repair. 
Consider a case where we have a 300 node cluster w/ RF 3, where each view 
partition contains entries mapping to every token range in the cluster - so 100 
ranges. If we lose a view sstable, it will affect an entire row/column of the 
grid. Repair is going to scan all data in the mismatching view token ranges 100 
times, and each base range once. So you’re looking at 200 range scans.

Now, you may argue that you can merge the duplicate view scans into a single 
scan while you repair all token ranges in parallel. I’m skeptical that’s going 
to be achievable in practice, but even if it is, we’re now talking about the 
view replica hypothetically doing a pairwise repair with every other replica in 
the cluster at the same time. Neither of these options is workable.

Let’s take a step back though, because I think we’re getting lost in the weeds.

The repair design in the CEP has some high level concepts that make a lot of 
sense, the idea of repairing a grid is really smart. However, it has some 
significant drawbacks that remain unaddressed. I want this CEP to succeed, and 
I know Jon does too, but the snapshot repair design is not a viable path 
forward. It’s the first iteration of a repair design. We’ve proposed a second 
iteration, and we’re open to a third iteration. This part of the CEP process is 
meant to identify and address shortcomings, I don’t think that continuing to 
dissect the snapshot repair design is making progress in that direction.

On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> >  We potentially have to do it several times on each node, depending on the 
> > size of the range. Smaller ranges increase the size of the board 
> > exponentially, larger ranges increase the number of SSTables that would be 
> > involved in each compaction.
> As described in the CEP example, this can be handled in a single round of 
> repair. We first identify all the points in the grid that require repair, 
> then perform anti-compaction and stream data based on a second scan over 
> those identified points. This applies to the snapshot-based solution—without 
> an index, repairing a single point in that grid requires scanning the entire 
> base table partition (token range). In contrast, with the index-based 
> solution—as in the example you referenced—if a large block of data is 
> corrupted, even though the index is used for comparison, many key mismatches 
> may occur. This can lead to random disk access to the original data files, 
> which could cause performance issues. For the case you mentioned for snapshot 
> based solution, it should not take months to repair all the data, instead one 
> round of repair should be enough. The actual repair phase is split from the 
> detection phase.
> 
> 
> On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad  wrote:
>> > This isn’t really the whole story. The amount of wasted scans on index 
>> > repairs is negligible. If a difference is detected with snapshot repairs 
>> > though, you have to read the entire partition from both the view and base 
>> > table to calculate what needs to be fixed.
>> 
>> You nailed it.
>> 
>> When the base table is converted to a view, and sent to the view, the 
>> information we have is that one of the view's partition keys needs a repair. 
>>  That's going to be different from the partition key of the base table.  As 
>> a result, on the base table, for each affected range, we'd have to issue 
>> another compaction across the entire set of sstables that could have the 
>> data the view needs (potentially many GB), in order to send over the 
>> corrected version of the partition, then send it over to the view.  Without 
>> an index in place, we have to do yet another scan, per-affected range.  
>> 
>> Consider the case of a single corrupted SSTable on the view that's removed 
>> from the filesystem, or the data is simply missing after being restored from 
>> an inconsistent backup.  It presumably contains lots of partitions, which 
>> maps to base partitions all over the cluster, in a lot of different token 
>> ranges.  For every one of those ranges (hundreds, to tens of thousands of 
>> them given the checkerboard design), when finding the missing data in the 
>> base, you'll have to perform a compaction across all the SSTables that 
>> potentially contain the missing data just to rebuild the view-oriented 
>> partitions that need to be sent to the view.  The complexity of this 
>> operation can be looked at as O(N*M) where N and M are the number of ranges 
>> in the base table and the view affected by the corruption, respectively.  
>> Without an index in place, finding the missing data is very expensive.  We 
>> potentially have to do it several times on each node, depending on the size 
>> of the range.  Smaller ranges increase the size of the board exponentially, 
>> larger ranges in