Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
Thanks for the information, Yifan and James!Given that, we can scope this email discussion only for this specific MV repair. Two points:1. Can this MV repair job provide some value addition?2. If yes, does it even make sense to merge this MV repair tooling, which uses Spak as its underlying technol

Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Yifan Cai
Oh, I just noticed that James already mentioned it. On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai wrote: > I would like to highlight an existing tooling for "many things beyond the > MV work, such as counting rows, etc." > > The Apache Cassandra Analytics project ( > http://github.com/apache/cassandr

Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Yifan Cai
I would like to highlight an existing tooling for "many things beyond the MV work, such as counting rows, etc." The Apache Cassandra Analytics project ( http://github.com/apache/cassandra-analytics/) could be a great resource for this type of task. It reads directly from the SSTables in the Spark

Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
There are two approaches I have been thinking about for MV. *1. **Short Term (**Status Quo)* Here, we do not improve Cassandra MV architecture such that it reduces the data inconsistencies drastically; thus, we continually mark MV as an experimental feature. In this case, we can have two suboptio

Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jeff Jirsa
It feels uncomfortable asking users to rely on a third party that’s as heavy-weight as spark to use a built-in feature. Can we really not do this internally? I get that the obvious way with merkle trees is hard because the range fanout of the MV using a different partitioner, but have we tried

Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread James Berragan
I think this would be useful and - having never really used Materialized Views - I didn't know it was an issue for some users. I would say the Cassandra Analytics library (http://github.com/apache/cassandra-analytics/) could be utilized for much of this, with a specialized Spark job for this purpos

[DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread Jaydeep Chovatia
Hi, *NOTE: *This email does not promote using Cassandra's Materialized View (MV) but assists those stuck with it for various reasons. The primary issue with MV is that once it goes out of sync with the base table, no tooling is available to remediate it. This Spark job aims to fill this gap by lo