Hi, I think this might be a great contribution in the light of removed Hadoop integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If this CEP is adopted and delivered, I can see how it might be a logical replacement of that.
Regards ________________________________________ From: Doug Rohrer <droh...@apple.com> Sent: Thursday, March 23, 2023 18:33 To: dev@cassandra.apache.org Cc: James Berragan Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi everyone, Wiki: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics We’d like to propose this CEP for adoption by the community. It is common for teams using Cassandra to find themselves looking for a way to interact with large amounts of data for analytics workloads. However, Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as the native read/write paths weren’t designed for bulk analytics. We’re proposing this CEP for this exact purpose. It enables the implementation of custom Spark (or similar) applications that can either read or write large amounts of Cassandra data at line rates, by accessing the persistent storage of nodes in the cluster via the Cassandra Sidecar. This CEP proposes new APIs in the Cassandra Sidecar and a companion library that allows deep integration into Apache Spark that allows its users to bulk import or export data from a running Cassandra cluster with minimal to no impact to the read/write traffic. We will shortly publish a branch with code that will accompany this CEP to help readers understand it better. As a reminder, please keep the discussion here on the dev list vs. in the wiki, as we’ve found it easier to manage via email. Sincerely, Doug Rohrer & James Berragan