Hi,

I think this might be a great contribution in the light of removed Hadoop 
integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If 
this CEP is adopted and delivered, I can see how it might be a logical 
replacement of that.

Regards

________________________________________
From: Doug Rohrer <droh...@apple.com>
Sent: Thursday, March 23, 2023 18:33
To: dev@cassandra.apache.org
Cc: James Berragan
Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan

Reply via email to