Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Miklosovic, Stefan Fri, 24 Mar 2023 03:17:29 -0700

Good point, Benjamin.

You wrote: "library could be offered as an open source project outside of the 
Cassandra project itself".


If Cassandra's code makes integrations like these possible (which I guess is 
the part of CEP), is there any reason this has to live under Cassandra project 
umbrella instead of hosting it in a separate repository?

We might definitely advertise / propagate that on the website here on the 
ecosystem page (1).

The logical successor of the Hadoop integration (which we had in the repository 
until recently) does not have to be in the repository again. We might expose 
ourselves unnecessarily to the same risk we had with Hadoop if the code is not 
maintained anymore for various reasons, being it technological obsolescence or 
shortage of maintainers.

(1) https://cassandra.apache.org/_/ecosystem.html

________________________________________
From: Benjamin Lerer <ble...@apache.org>
Sent: Friday, March 24, 2023 10:35
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
Bulk Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi Doug,

Outside of the changes to the Cassandra Sidecar that are mentioned, what the 
CEP proposes is the donation of a library for Spark integration. It seems to me 
that this library could be offered as an open source project outside of the 
Cassandra project itself. If we accept Spark Bulk Analytic as part of the 
Cassandra project it means that the community will commit to maintain it and 
ensure that for each Cassandra release it will be fully compatible. Considering 
our history with Hadoop integration which has basically been unmaintained for 
years, I am not convinced that it is what we should do.
We only started to expand the scope of the project recently and I would 
personally prefer that we do that slowly starting with the drivers that are 
critical for C*. Now, it is only my personal opinion and other people might 
have a different view on those things.

Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan 
<stefan.mikloso...@netapp.com<mailto:stefan.mikloso...@netapp.com>> a écrit :
Hi,

I think this might be a great contribution in the light of removed Hadoop 
integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If 
this CEP is adopted and delivered, I can see how it might be a logical 
replacement of that.

Regards

________________________________________
From: Doug Rohrer <droh...@apple.com<mailto:droh...@apple.com>>
Sent: Thursday, March 23, 2023 18:33
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org>
Cc: James Berragan
Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to