Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
On the Sidecar discussion, while Sidecar is the preferred mechanism for the 
reasons described, the API is sufficiently generic enough to plugin a user 
implementations (essentially provide a list of sstables for a token range, and 
a mechanism to open an InputStream on any SSTable file component). A user could 
- for example - easily read from backup snapshots on a blob store.

> On Mar 26, 2023, at 1:04 PM, Josh McKenzie  wrote:
> 
> I want to second what Yifan's spoken to, specifically in terms of resource 
> isolation and availability.
> 
> While the sidecar hasn't seen a ton of traffic and contributions since the 
> acceptance into the project and clearance of CEP-1, my intuition is that 
> that's due to the entrenched maturity of alternative sidecars out there since 
> we were slow as a project to build one, not out of a lack of demand for a 
> fully fleshed out sidecar. As functionality shows up in the ASF C* Sidecar, 
> there's going to be tension as operators are incentivized to run both their 
> bespoke sidecars they may be running alongside the ASF C* one. That's to be 
> expected and a necessary pain to take on during a transition that I 
> personally think is sorely needed.
> 
> Having bulk operations for analytics and for reading and writing SSTables is 
> a pretty compelling carrot, and the more folks we can get running the sidecar 
> and the more contributors active on it, the more we can expect to see 
> interest and work show up there (repair coordination, REST API's, etc - all 
> of which we've talked about before on ML or slack).
> 
> So I'm a strong +1 to it living in the sidecar.
> 
> On Sat, Mar 25, 2023, at 11:05 AM, Brandon Williams wrote:
>> Oh, that's significantly different and great news, please do!  Thanks
>> for the clarification, Doug!
>> 
>> Kind Regards,
>> Brandon
>> 
>> On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer > > wrote:
>> >
>> > I agree that the analytics library will need to support vnodes. To be 
>> > clear, there’s nothing preventing the solution from working with vnodes 
>> > right now, and no assumptions about a 1:1 topology between a token and a 
>> > node. However, we don’t, today, have the ability to test vnode support 
>> > end-to-end. We are working towards that, however, and should be able to 
>> > remove the caveat from the released analytics library once we can properly 
>> > test vnode support.
>> > If it helps, I can update the CEP to say something more like “Caveat: 
>> > Currently untested with vnodes - work is ongoing to remove this 
>> > limitation” if that helps?
>> >
>> > Doug
>> >
>> > > On Mar 24, 2023, at 11:43 AM, Brandon Williams > > > > wrote:
>> > >
>> > > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > > mailto:jeremiah.jor...@gmail.com>> wrote:
>> > >>
>> > >> I have concerns with the majority of this being in the sidecar and not 
>> > >> in the database itself.  I think it would make sense for the server 
>> > >> side of this to be a new service exposed by the database, not in the 
>> > >> sidecar.  That way it can be able to properly integrate with the 
>> > >> authentication and authorization apis, and to make it a first class 
>> > >> citizen in terms of having unit/integration tests in the main DB 
>> > >> ensuring no one breaks it.
>> > >
>> > > I don't think this can/should happen until it supports the database's
>> > > default configuration with vnodes.
>> >



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
Complex predicates on non-partition keys naturally require pulling the entire 
data set into the Spark DataFrame to perform the query. We have some 
optimizations around column filtering and partition key predicates, utilizing 
the Filter.db/Summary.db/Index.db files to only read the data it needs. We have 
chatted to Caleb about utilizing the index file for SAIs but at present it is 
purely theoretical.

In terms of internals, beyond some util/serializer classes, the writer part 
depends on the CQLSSTableWriter and the reader uses the SSTableSimpleIterator 
and the CompactionIterator.

James.

> On Mar 27, 2023, at 3:06 PM, Jeremy Hanna  wrote:
> 
> Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds 
> like you've been using this for some time.  I understand from the rejected 
> alternatives that the Spark Cassandra Connector was slower because it goes 
> through the read and write path for C* rather than this backdoor mechanism.  
> In your experience using this, under what circumstances have you seen that 
> this tool is not a good fit for analytics - such as complex predicates?  The 
> challenge with the Spark Cassandra Connector and previously the Hadoop 
> integration is that it had to do full table scans even to get small amounts 
> of data.  It sounds like this is similar in that it has to do a full table 
> scan but with the advantage of being faster and less load on the cluster.  In 
> other words, I'm asking if this has been a replacement for the Spark 
> Cassandra Connector or if there are cases in your work where SCC is a better 
> fit.
> 
> Also to Benjamin's point in the comments on the CEP itself, how coupled is 
> this to internals?  Are there going to be higher level APIs or is it going to 
> call internal storage classes directly?
> 
> Thanks!
> 
> Jeremy
> 
> 
>> On Mar 23, 2023, at 12:33 PM, Doug Rohrer  wrote:
>> 
>> Hi everyone,
>> 
>> Wiki: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>> 
>> We’d like to propose this CEP for adoption by the community.
>> 
>> It is common for teams using Cassandra to find themselves looking for a way 
>> to interact with large amounts of data for analytics workloads. However, 
>> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
>> as the native read/write paths weren’t designed for bulk analytics.
>> 
>> We’re proposing this CEP for this exact purpose. It enables the 
>> implementation of custom Spark (or similar) applications that can either 
>> read or write large amounts of Cassandra data at line rates, by accessing 
>> the persistent storage of nodes in the cluster via the Cassandra Sidecar.
>> 
>> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
>> that allows deep integration into Apache Spark that allows its users to bulk 
>> import or export data from a running Cassandra cluster with minimal to no 
>> impact to the read/write traffic.
>> 
>> We will shortly publish a branch with code that will accompany this CEP to 
>> help readers understand it better.
>> 
>> As a reminder, please keep the discussion here on the dev list vs. in the 
>> wiki, as we’ve found it easier to manage via email.
>> 
>> Sincerely,
>> 
>> Doug Rohrer & James Berragan
> 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-12 Thread James Berragan
Hi Stefan, CDC is something we are also thinking about, and worthy of a 
separate discussion. We have tested Spark Streaming for CDC and I hope we can 
bolt on in the future, but streaming technologies also come with more caveats 
and nuances (we have found it beneficial with CDC to store a small amount of 
state, which is at odds with Spark’s more stateless architecture). From that 
perspective I think it makes sense to keep CDC technology agnostic and let the 
user plug in to whichever system they want (Spark Streaming, Flink, custom etc).

James.

> On Apr 11, 2023, at 1:19 PM, Miklosovic, Stefan 
>  wrote:
> 
> Doug,
> 
> thanks for the diagrams, really helpful.
> 
> Do you think there might be some extension to this CEP (does not need to be 
> necessarily included from the very beginning, just food for though at this 
> point) which would read data from the commit log / CDC?
> 
> The main motivation behind this is that when one looks around in terms of 
> what is currently possible with Spark, Cassandra often exists as a sink only 
> when comes to streaming. For example, take Spark. We can use Kafka connector 
> (1) so data would come to Kafka, it would be streamed to Spark as RDDs and 
> Spark would save it to Cassandra via Spark Cassandra Connector. Such 
> transformation / pipeline is indeed possible.
> 
> We have also Cassandra + Ignite integration (2, 3) so Ignite can act as 
> in-memory caching layer on top of Cassandra which enables users to do 
> transformations over IgniteRDD and queries which are not possible normally. 
> (e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is 
> no Ignite streamer which would consider Cassandra to be a realtime / near 
> realtime source.
> 
> So, there is currently no integration done (correct me if I am wrong) which 
> would have Cassandra as _real time_ source.
> 
> Looking into these diagrams, when you are able to load data from Cassandra 
> from SSTables, would it be possible to continually fetch offset in CDC index 
> file (these changes were done in 4.0 for the first time I think, ask Josh 
> McKenzie about the details), read these mutations and send it via Sidecar to 
> Spark?
> 
> Currently, the only solution I know of which is doing realtime-ish streaming 
> of mutations from CDC is Debezium Cassandra connector but it is pushing these 
> mutations straight to Kafka only. I would love to have it in Spark first and 
> then I can do whatever I want with that.
> 
> (1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> (2) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview
> (3) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd
> (4) https://github.com/debezium/debezium-connector-cassandra
> 
> 
> From: Doug Rohrer mailto:droh...@apple.com>>
> Sent: Tuesday, April 11, 2023 0:37
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
> Bulk Analytics
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> I’ve updated the CEP with two overview diagrams of the interactions between 
> Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
> better understand how things work, and thanks for the patience as it took a 
> bit longer than expected for me to find the time for this.
> 
> Doug
> 
> On Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
> 
> 
> A lot of great discussions!
> 
> On the sidecar front, especially what the role sidecar plays in terms of this 
> CEP, I feel there might be some confusion. Once the code is published, we 
> should have clarity.
> Sidecar does not read sstables nor do any coordination for analytics queries. 
> It is local to the companion Cassandra instance. For bulk read, it takes 
> snapshots and streams sstables to spark workers to read. For bulk write, it 
> imports the sstables uploaded from spark workers. All commands are existing 
> jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface 
> to them. It might be an over simplified description. The complex computation 
> is performed in spark clusters only.
> 
> In the long run, Cassandra might evolve into a database that does both OLTP 
> and OLAP. (Not what this thread aims for)
> At the current stage, Spark is very suited for analytic purposes.
> 
> On Tue, Mar 28, 2023 at 9:06 A

Spark-Cassandra Bulk Reader: CASSANDRA-16222

2020-10-23 Thread James Berragan
Hi everyone,

I want to highlight to the dev community CASSANDRA-16222
, a Spark library we
have been working on that can compact and read raw Cassandra SSTables into
SparkSQL.

By reading the sstables directly from a snapshot directory we are able to
achieve high performance with minimal impact to a production cluster. As an
example, we successfully exported a ~32TB Cassandra table (~46bn cql rows)
to HDFS in Parquet format in around 1h10m, a 20x improvement on previous
solutions.

You can find the code on GitHub:
https://github.com/jberragan/spark-cassandra-bulkreader.

We would like to contribute the code to the project and open to more
Cassandra users.

James.


Re: [VOTE] CEP-44: Kafka integration for Cassandra CDC using Sidecar

2024-10-21 Thread James Berragan
Vote passes with seven +1s (five binding) and no vetoes.

Ref: https://lists.apache.org/thread/5pq0js4cvnxozrs2cf63p3jf7qk0h1rc.

James.

On Fri, 18 Oct 2024 at 12:51, Dinesh Joshi  wrote:

> +1
>
> On Fri, Oct 18, 2024 at 11:09 AM Jon Haddad 
> wrote:
>
>> +1
>>
>> On Fri, Oct 18, 2024 at 10:51 AM Bernardo Botella <
>> conta...@bernardobotella.com> wrote:
>>
>>> +1 nb
>>>
>>> On Oct 17, 2024, at 5:52 PM, Josh McKenzie  wrote:
>>>
>>> +1
>>>
>>> On Thu, Oct 17, 2024, at 2:51 PM, Yifan Cai wrote:
>>>
>>> +1 nb
>>>
>>> --
>>>
>>> *From:* Brandon Williams 
>>> *Sent:* Thursday, October 17, 2024 11:47:13 AM
>>> *To:* dev@cassandra.apache.org 
>>> *Subject:* Re: [VOTE] CEP-44: Kafka integration for Cassandra CDC using
>>> Sidecar
>>>
>>> +1
>>>
>>> Kind Regards,
>>> Brandon
>>>
>>> On Thu, Oct 17, 2024 at 1:08 PM James Berragan 
>>> wrote:
>>> >
>>> > Hi everyone,
>>> >
>>> > I would like to start the voting for CEP-44 as all the feedback in the
>>> discussion thread seems to be addressed.
>>> >
>>> > Proposal: CEP-44: Kafka integration for Cassandra CDC using Sidecar
>>> > Discussion thread:
>>> https://lists.apache.org/thread/8k6njsnvdbmjb6jhyy07o1s7jz8xp1qg
>>> >
>>> > As per the CEP process documentation, this vote will be open for 72
>>> hours (longer if needed).
>>> >
>>> > Thanks!
>>> > James.
>>>
>>>
>>>


[DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

2024-09-27 Thread James Berragan
Hi everyone,

Wiki:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar

We would like to propose this CEP for adoption by the community.

CDC is a common technique in databases but right now there is no
out-of-the-box solution to do this easily and at scale with Cassandra. Our
proposal is to build a fully-fledged solution into the Apache Cassandra
Sidecar. This comes with a number of benefits:
- Sidecar is an official part of the existing Cassandra eco-system.
- Sidecar runs co-located with Cassandra instances and so scales with the
cluster size.
- Sidecar can access the underlying Cassandra database to store CDC
configuration and the CDC state in a special table.
- Running in the Sidecar does not require additional external resources to
run.

The core CDC module we anticipate will be pluggable and re-usable, it is
available for review here:
https://github.com/apache/cassandra-analytics/pull/87. The remaining
Sidecar code will follow.

As a reminder, please keep the discussion here on the dev list vs. in the
wiki, as we’ve found it easier to manage via email.

Sincerely,
James Berragan
Bernardo Botella Corbi
Yifan Cai
Jyothsna Konisa


Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

2024-09-30 Thread James Berragan
Thanks for the discussions. I do anticipate that Accord will make things
very much better, however I think if consumers are ultimately going to be
replay the log into some other system (say Apache Iceberg) exact-once
delivery will always be tricky, but perhaps not entirely necessary given
the linearizability guarantees Accord will bring, either to safely replay
mutations in order or to do something like "APPLY mutation IFF
current.txn_id < mutation.txn_id". This does still mean some load on the
end user to verify, even if the guarantees are stronger, and not too far
off what we currently recommend to verify the last-write-win timestamps.
(My Accord insight is obviously limited so I'm happy to be corrected if I'm
wrong anywhere here!)

On token leadership, the Sidecar comes with soft/implicit primary range
ownership of the co-located Cassandra node(s), we implemented a basic
failover process to satisfy availability upto RF=3, but I think for strong
guarantees hooking into TCM would be the ultimate goal.

James.


Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

2024-10-01 Thread James Berragan
cular use
> case. I don't see much on how this would be handled other than "left to the
> end user to figure out."
>
> There is also little mention of where the increased resource load would be
> handled.
>
> This has been discussed many times before, but is it time to introduce the
> concept of an elected leader for a token range for this type of operation?
> It would eliminate a ton of problems that need to managed when bridging c*
> to a system like Kafka. Last time it was discussed in earnest was for
> KIP-30:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>
>
> Patrick
>
> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad 
> wrote:
>
> Yes! I’m really looking forward to trying this out. The CEP looks really
> well thought out. I think this will make CDC a lot more useful for a lot of
> teams.
> Jon
>
>
> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie 
> wrote:
>
>
> Really excited to see this hit the ML James.
>
> As author of the base CDC (get your stones ready for throwing :D) and
> someone moderately involved in the CEP here, definitely welcome any
> questions. CDC is a *thorny* *problem *in a multi-replica distributed
> system like this.
>
> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>
> Hi everyone,
>
> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>
> We would like to propose this CEP for adoption by the community.
>
> CDC is a common technique in databases but right now there is no
> out-of-the-box solution to do this easily and at scale with Cassandra. Our
> proposal is to build a fully-fledged solution into the Apache Cassandra
> Sidecar. This comes with a number of benefits:
> - Sidecar is an official part of the existing Cassandra eco-system.
> - Sidecar runs co-located with Cassandra instances and so scales with the
> cluster size.
> - Sidecar can access the underlying Cassandra database to store CDC
> configuration and the CDC state in a special table.
> - Running in the Sidecar does not require additional external resources to
> run.
>
> The core CDC module we anticipate will be pluggable and re-usable, it is
> available for review here:
> https://github.com/apache/cassandra-analytics/pull/87. The remaining
> Sidecar code will follow.
>
> As a reminder, please keep the discussion here on the dev list vs. in the
> wiki, as we’ve found it easier to manage via email.
>
> Sincerely,
> James Berragan
> Bernardo Botella Corbi
> Yifan Cai
> Jyothsna Konisa
>
>
>
>


[VOTE] CEP-44: Kafka integration for Cassandra CDC using Sidecar

2024-10-17 Thread James Berragan
Hi everyone,

I would like to start the voting for CEP-44 as all the feedback in the
discussion thread seems to be addressed.

Proposal: CEP-44: Kafka integration for Cassandra CDC using Sidecar

Discussion thread:
https://lists.apache.org/thread/8k6njsnvdbmjb6jhyy07o1s7jz8xp1qg

As per the CEP process documentation, this vote will be open for 72 hours
(longer if needed).

Thanks!
James.


Re: [DISCUSS] Tooling to repair MV through a Spark job

2024-12-06 Thread James Berragan
I think this would be useful and - having never really used Materialized
Views - I didn't know it was an issue for some users. I would say the
Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
could be utilized for much of this, with a specialized Spark job for this
purpose.

On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia 
wrote:

> Hi,
>
> *NOTE: *This email does not promote using Cassandra's Materialized View
> (MV) but assists those stuck with it for various reasons.
>
> The primary issue with MV is that once it goes out of sync with the base
> table, no tooling is available to remediate it. This Spark job aims to fill
> this gap by logically comparing the MV with the base table and identifying
> inconsistencies. The job primarily does the following:
>
>- Scans Base Table (A), MV (B), and do {A}-{B} analysis
>- Categorize each record into one of the four areas: a) Consistent, b)
>Inconsistent, c) MissingInMV, d) MissingInBaseTable
>- Provide a detailed view of mismatches, such as the primary key, all
>the non-primary key fields, and mismatched columns.
>- Dumps the detailed information to an output folder path provided to
>the job (one can extend the interface to dump the records to some object
>store as well)
>- Optionally, the job fixes the MV inconsistencies.
>- Rich configuration (throttling, actionable output, capability to
>specify the time range for the records, etc.) to run the job at Scale in a
>production environment
>
> Design doc: link
> 
> The Git Repository: link
> 
>
> *Motivation*
>
>1. This email's primary objective is to share with the community that
>something like this is available for MV (in a private repository), which
>may be helpful in emergencies to folks stuck with MV in production.
>2. If we, as a community, want to officially foster tooling using
>Spark because it can be helpful to do many things beyond the MV work, such
>as counting rows, etc., then I am happy to drive the efforts.
>
> Please let me know what you think.
>
> Jaydeep
>


Re: [DISCUSS] Snapshots outside of Cassandra data directory

2025-01-21 Thread James Berragan
I think this is an idea worth exploring, my guess is that even if the scope
is confined to just "copy if not exists" it would still largely be used as
a cloud-agnostic backup/restore solution, and so will be shaped accordingly.

Some thoughts:

- I think it would be worth exploring more what the directory structure
looks like. You mention a flat directory hierarchy, but it seems to me it
would need to be delimited by node (or token range) in some way as the
SSTable identifier will not be unique across the cluster. If we do need to
delimit by node, is the configuration burden then on the user to mount
individual drives to S3/Azure/wherever to unique per node paths? What do
they do in the event of a host replacement, backup to a new empty
directory?

- The challenge often with restore is restoring from snapshots created
before a cluster topology change (node replacements, token moves,
cluster expansions/shrinks etc). This could be solved by storing the
snapshot token information in the manifest somewhere. Ideally the user
shouldn't have to scan token information snapshot-wide all SSTables to
determine which ones to restore.

- I didn't understand the TTL mechanism. If we only copy SSTables that
haven't been seen before, some SSTables will exist indefinitely across
snapshots (i.e. L4), while others (in L0) will quickly disappear. There
needs to be a mechanism to determine if the SSTable is expirable (i.e. no
longer exists in active snapshots) by comparing the manifests at the time
of snapshot TTL.

Broadly it sounds like we are saving the operator the burden of performing
snapshot uploads to some cloud service, but there are benefits (at least
from a backup perspective) of performing independently - i.e. managing
bandwidth usage or additional security layers.

James.

On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič 
wrote:

> If you ask specifically about how TTL snapshots are handled, there is a
> thread running with a task scheduled every n seconds (not sure what is the
> default) and it just checks against "expired_at" field in manifest if it is
> expired or not. If it is then it will proceed to delete it as any other
> snapshot. Then the logic I have described above would be in place.
>
> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič 
> wrote:
>
>>
>>
>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero 
>> wrote:
>>
>>> I think we should evaluate the benefits of the feature you are proposing
>>> independently on how it might be used by Sidecar or other tools. As it
>>> is, it already sounds like a useful functionality to have in the core of
>>> the
>>> Cassandra process.
>>>
>>> Tooling around Cassandra, including Sidecar, can then leverage this
>>> functionality to create snapshots, and then add additional capabilities
>>> on top to perform backups.
>>>
>>> I've added some comments inline below:
>>>
>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote:
>>> > Hi,
>>> >
>>> > I would like to run this through ML to gather feedback as we are
>>> > contemplating about making this happen.
>>> >
>>> > Currently, snapshots are just hardlinks located in a snapshot
>>> directory to
>>> > live data directory. That is super handy as it occupies virtually zero
>>> disk
>>> > space etc (as long as underlying SSTables are not compacted away, then
>>> > their size would "materialize").
>>> >
>>> > On the other hand, because it is a hardlink, it is not possible to make
>>> > hard links across block devices (infamous "Invalid cross-device link"
>>> > error). That means that snapshots can ever be located on the very same
>>> disk
>>> > Cassandra has its datadirs on.
>>> >
>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS share)
>>> mounted
>>> > to a Cassandra node and they would like to use that as a cheap / cold
>>> > storage of snapshots. They do not care about the speed of such storage
>>> nor
>>> > they care about how much space it occupies etc. when it comes to
>>> snapshots.
>>> > On the other hand, they do not want to have snapshots occupying a disk
>>> > space where Cassandra has its data because they consider it to be a
>>> waste
>>> > of space. They would like to utilize fast disk and disk space for
>>> > production data to the max and snapshots might eat a lot of that space
>>> > unnecessarily.
>>> >
>>> > There might be a configuration property like "snapshot_root_dir:
>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy
>>> SSTables
>>> > there, but we need to be a little bit smart here (By default, it would
>>> all
>>> > work as it does now - hard links to snapshot directories located under
>>> > Cassandra's data_file_directories.)
>>> >
>>> > Because it is a copy, it occupies disk space. But if we took 100
>>> snapshots
>>> > on the same SSTables, we would not want to copy the same files 100
>>> times.
>>> > There is a very handy way to prevent this - unique SSTable identifiers
>>> > (under already existing uuid_sstable_identifiers_enabled property) so
>>> we
>>> >