Re: [EXTERNAL] [DISCUSS] Next release date

2023-03-27 Thread Henrik Ingo
Not so fast...

There's certainly value in spending that time stabilizing the already done
features. It's valuable triaging information to say this used to work
before CEP-21 and only broke after it.

That said, having a very long freeze of trunk, or alternatively having a
very long lived 5.0 branch that is waiting for Accord and diverging with a
trunk that is not frozen... are both undesirable options. (A month or two
could IMO be discussed though.) So I agree with the concern from that point
of view, I just don't agree that having one batch of big features in
stabilization period is zero value.


henrik



On Fri, Mar 24, 2023 at 5:23 PM Jeremiah D Jordan 
wrote:

> Given the fundamental change to how cluster operations work coming from
> CEP-21, I’m not sure what freezing early for “extra QA time” really buys
> us?  I wouldn’t trust any multi-node QA done pre commit.
> What “stabilizing” do we expect to be doing during this time?  How much of
> it do we just have to do again after those things merge?  I for one do not
> like to have release branches cut months before their expected release.  It
> just adds extra merge forward and “where should this go”
> questions/overhead.  It could make sense to me to branch branch when CEP-21
> merges and only let in CEP-15 after that.  CEP-15 is mostly “net new stuff”
> and not “changes to existing stuff” from my understanding?  So no QA effort
> wasted if it is done before it merges.
>
> -Jeremiah
>
> On Mar 24, 2023, at 9:38 AM, Josh McKenzie  wrote:
>
> I would like to propose a partial freeze of 5.0 in June
>
> My .02:
> +1 to:
> * partial freeze on an agreed upon date w/agreed upon other things that
> can optionally go in after
> * setting a hard limit on when we ship from that frozen branch regardless
> of whether the features land or not
>
> -1 to:
> * ever feature freezing trunk again. :)
>
> I worry about the labor involved with having very large work like this
> target a frozen branch and then also needing to pull it up to trunk. That
> doesn't sound fun.
>
> If we resurrected the discussion about cutting alpha snapshots from trunk,
> would that change people's perspectives on the weight of this current
> decision? We'd probably also have to re-open pandora's box talking about
> the solidity of our API's on trunk as well if we positioned those alphas as
> being stable enough to start prototyping and/or building future
> applications against.
>
> On Fri, Mar 24, 2023, at 9:59 AM, Brandon Williams wrote:
>
> I am +1 on a 5.0 branch freeze.
>
> Kind Regards,
> Brandon
>
> On Fri, Mar 24, 2023 at 8:54 AM Benjamin Lerer  wrote:
> >>
> >> Would that be a trunk freeze, or freeze of a cassandra-5.0 branch?
> >
> >
> > I was thinking of a cassandra-5.0 branch freeze. So branching 5.0 and
> allowing only CEP-15 and 21 + bug fixes there.
> > Le ven. 24 mars 2023 à 13:55, Paulo Motta  a
> écrit :
> >>
> >> >  I would like to propose a partial freeze of 5.0 in June.
> >>
> >> Would that be a trunk freeze, or freeze of a cassandra-5.0 branch? I
> agree with a branch freeze, but not with trunk freeze.
> >>
> >> I might work on small features after June and would be happy to delay
> releasing these on 5.0+, but delaying merge to trunk until 5.0 is released
> could be disruptive to contributors workflows and I would prefer to avoid
> that if possible.
> >>
> >> On Fri, Mar 24, 2023 at 6:37 AM Mick Semb Wever  wrote:
> >>>
> >>>
>  I would like to propose a partial freeze of 5.0 in June.
> 
>  …
> 
>  This partial freeze will be valid for every new feature except CEP-21
> and CEP-15.
> >>>
> >>>
> >>>
> >>> +1
> >>>
> >>> Thanks for summarising the thread this way Benjamin. This addresses my
> two main concerns: letting the branch/release date slip too much into the
> unknown, squeezing GA QA efforts, while putting in place exceptional
> waivers for CEP-21 and CEP-15.
> >>>
> >>> I hope that in the future we will be more willing to commit to the
> release train model: less concerned about "what the next release contains";
> more comfortable letting big features land where they land. But this is
> opinion and discussion for another day… possibly looping back to the
> discussion on preview releases…
> >>>
> >>>
> >>> Do we have yet from anyone a (rough) eta on CEP-15 (post CEP-21)
> landing in trunk?
> >>>
> >>>
>
>
>

-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

  
  


Re: [DISCUSS] cep-15-accord, cep-21-tcm, and trunk

2023-03-27 Thread Henrik Ingo
Seems like this thread is the more appropriate one for Accord/TCM discussion

IMO the priority here should be:

1. Release CEP-15 as part of 5.0, this year, with or without  CEP-21.
2. Minimize work arising from porting between branches. (e.g. first onto
CEP-21, then to trunk, or vice versa. But also then between 5.0 and trunk)
3. Minimize work arising from temporary solutions that support goal #1


If CEP-21 is the major source of uncertainty right now, then we should
merge CEP-15 to trunk independently. If CEP-15 depending on CEP-21 just
means an additional month (or two) delay, then that's fine and there's no
need to do additional work just to get some preview release out earlier.

henrik

On Sat, Mar 25, 2023 at 4:17 AM Caleb Rackliffe 
wrote:

> I agree there’s little point in litigating right now, given test stability
> (or lack thereof) in cep-21-tcm. Eventually, though, I’m more or less
> aligned w/ David in the sense that getting ourselves planted on top of TCM
> as soon as possible is a good idea.
>
> On Mar 24, 2023, at 3:04 PM, Benedict  wrote:
>
> 
> It’s not even clear such an effort would need to be different from that
> used by cep-21. The point is that there’s not much point litigating this
> now when we can keep our options open and better decouple the projects,
> since I don’t think we lose much this way.
>
> On 24 Mar 2023, at 19:58, David Capwell  wrote:
>
> Assuming we do not release with it, then yes, as we wouldn’t need to
> maintain.  My point for this case was that I don’t feel the time cost is
> worth it, I am not -1 if someone wants to add, was more saying our time is
> better off else where.
>
> We currently don’t touch Transactional Metadata, we have custom logic
> (which relies on client to tell every instance in the cluster an update
> happened) that we are using right now in tests and deployed clusters.  So
> once we can integrate with Transactional Metadata, we stop relying on
> clients to tell us about topology changes… a custom/thrown away epoch
> generator would make this current process more user friendly, but would
> need to spend time to make sure stable when deployed on a cluster
>
> On Mar 24, 2023, at 12:43 PM, Josh McKenzie  wrote:
>
> If this is in a release, we then need to maintain that feature, so would
> be against it.
>
> Isn't the argument that cep-21 provides this so we could just remove the
> temporary impl and point to the new facility for this generation?
>
> On Fri, Mar 24, 2023, at 3:22 PM, David Capwell wrote:
>
> the question we want to answer is whether or not we build a throwaway
> patch for linearizable epochs
>
>
> If this is in a release, we then need to maintain that feature, so would
> be against it.
>
> If this is for testing, then I would argue the current world is “fine”…
> current world is hard to use and brittle (users need to tell accord that
> the cluster changed), but if accord is rebasing on txn metadata then this
> won’t be that way long (currently blocked from doing that due to txn
> metadata not passing all tests yet).
>
> On Mar 24, 2023, at 12:12 PM, Josh McKenzie  wrote:
>
> FWIW, I'd still rather just integrate w/ TCM ASAP, avoiding integration
> risk while accepting the possible delivery risk.
>
> What does the chain of rebases against trunk look like here? cep-21-tcm
> rebase, then cep-15 on cep-21-tcm, then cep-7 on cep-21-tcm, then a race on
> whichever of 15 or 7 merge after 21 goes into trunk? Or 7 based on 15, the
> other way around...
>
> I'm not actively working on any of these branches so take my perspective
> with that appropriate grain of salt, but the coupling of these things seems
> to have it's own kind of breed of integration pain to upkeep over time
> depending on how frequently we're rebasing against trunk.
>
> the question we want to answer is whether or not we build a throwaway
> patch for linearizable epochs
>
> Do we have an informed opinion on how long we think this would take? Seems
> like that'd help clarify whether or not there's contributors with the
> bandwidth and desire to even do that or whether everyone depending on
> cep-21 is our option.
>
> On Fri, Mar 24, 2023, at 1:30 PM, Caleb Rackliffe wrote:
>
> I actually did a dry run rebase of cep-15-accord on top of cep-21-tcm
> here: https://github.com/apache/cassandra/pull/2227
>
> It wasn't too terrible, and I was actually able to get the main CQL-based
> Accord tests working as long as I disabled automatic forwarding of CAS and
> SERIAL read operations to Accord. The bigger issue was test stability in
> cep-21-tcm. I'm sure that will mature quickly here, and I created a few
> issues to attach to the Transactional Metadata epic
> .
>
> In the meantime, I rebased cep-15-accord on trunk at
> commit 3eb605b4db0fa6b1ab67b85724a9cfbf00aae7de. The option to finish the
> remaining bits of CASSANDRA-18196
>  and merge w/o TCM
> is still availab

Re: [DISCUSS] cep-15-accord, cep-21-tcm, and trunk

2023-03-27 Thread Benedict
I don’t know that TCM is a greater source of uncertainty. There is a degree of uncertainty about that :)I just think it’s better not to compound uncertainties, at least while it is not costly to avoid it.On 27 Mar 2023, at 08:27, Henrik Ingo  wrote:Seems like this thread is the more appropriate one for Accord/TCM discussionIMO the priority here should be:1. Release CEP-15 as part of 5.0, this year, with or without  CEP-21.2. Minimize work arising from porting between branches. (e.g. first onto CEP-21, then to trunk, or vice versa. But also then between 5.0 and trunk)3. Minimize work arising from temporary solutions that support goal #1If CEP-21 is the major source of uncertainty right now, then we should merge CEP-15 to trunk independently. If CEP-15 depending on CEP-21 just means an additional month (or two) delay, then that's fine and there's no need to do additional work just to get some preview release out earlier.henrikOn Sat, Mar 25, 2023 at 4:17 AM Caleb Rackliffe  wrote:I agree there’s little point in litigating right now, given test stability (or lack thereof) in cep-21-tcm. Eventually, though, I’m more or less aligned w/ David in the sense that getting ourselves planted on top of TCM as soon as possible is a good idea.On Mar 24, 2023, at 3:04 PM, Benedict  wrote:It’s not even clear such an effort would need to be different from that used by cep-21. The point is that there’s not much point litigating this now when we can keep our options open and better decouple the projects, since I don’t think we lose much this way.On 24 Mar 2023, at 19:58, David Capwell  wrote:Assuming we do not release with it, then yes, as we wouldn’t need to maintain.  My point for this case was that I don’t feel the time cost is worth it, I am not -1 if someone wants to add, was more saying our time is better off else where.We currently don’t touch Transactional Metadata, we have custom logic (which relies on client to tell every instance in the cluster an update happened) that we are using right now in tests and deployed clusters.  So once we can integrate with Transactional Metadata, we stop relying on clients to tell us about topology changes… a custom/thrown away epoch generator would make this current process more user friendly, but would need to spend time to make sure stable when deployed on a clusterOn Mar 24, 2023, at 12:43 PM, Josh McKenzie  wrote:If this is in a release, we then need to maintain that feature, so would be against it.Isn't the argument that cep-21 provides this so we could just remove the temporary impl and point to the new facility for this generation?On Fri, Mar 24, 2023, at 3:22 PM, David Capwell wrote:the question we want to answer is whether or not we build a throwaway patch for linearizable epochsIf this is in a release, we then need to maintain that feature, so would be against it.If this is for testing, then I would argue the current world is “fine”… current world is hard to use and brittle (users need to tell accord that the cluster changed), but if accord is rebasing on txn metadata then this won’t be that way long (currently blocked from doing that due to txn metadata not passing all tests yet).On Mar 24, 2023, at 12:12 PM, Josh McKenzie  wrote:FWIW, I'd still rather just integrate w/ TCM ASAP, avoiding integration risk while accepting the possible delivery risk.What does the chain of rebases against trunk look like here? cep-21-tcm rebase, then cep-15 on cep-21-tcm, then cep-7 on cep-21-tcm, then a race on whichever of 15 or 7 merge after 21 goes into trunk? Or 7 based on 15, the other way around...I'm not actively working on any of these branches so take my perspective with that appropriate grain of salt, but the coupling of these things seems to have it's own kind of breed of integration pain to upkeep over time depending on how frequently we're rebasing against trunk.the question we want to answer is whether or not we build a throwaway patch for linearizable epochsDo we have an informed opinion on how long we think this would take? Seems like that'd help clarify whether or not there's contributors with the bandwidth and desire to even do that or whether everyone depending on cep-21 is our option.On Fri, Mar 24, 2023, at 1:30 PM, Caleb Rackliffe wrote:I actually did a dry run rebase of cep-15-accord on top of cep-21-tcm here: https://github.com/apache/cassandra/pull/2227It wasn't too terrible, and I was actually able to get the main CQL-based Accord tests working as long as I disabled automatic forwarding of CAS and SERIAL read operations to Accord. The bigger issue was test stability in cep-21-tcm. I'm sure that will mature quickly here, and I created a few issues to attach to the Transactional Metadata epic.In the meantime, I rebased cep-15-accord on trunk at commit 3eb605b4db0fa6b1ab67b85724a9cfbf00aae7de. The option to finish the remaining bits of CASSANDRA-18196 and m

Re: [DISCUSS] cep-15-accord, cep-21-tcm, and trunk

2023-03-27 Thread Caleb Rackliffe
Minimizing uncertainty is a nice abstract goal. What I worry about is that we ultimately create more of it (and more work/thrashing for ourselves) by not basing Accord on TCM at the earliest responsible moment.Again, although I created this thread, the state of the world is telling me a decision doesn’t need to be made quite yet.On Mar 27, 2023, at 5:29 AM, Benedict  wrote:I don’t know that TCM is a greater source of uncertainty. There is a degree of uncertainty about that :)I just think it’s better not to compound uncertainties, at least while it is not costly to avoid it.On 27 Mar 2023, at 08:27, Henrik Ingo  wrote:Seems like this thread is the more appropriate one for Accord/TCM discussionIMO the priority here should be:1. Release CEP-15 as part of 5.0, this year, with or without  CEP-21.2. Minimize work arising from porting between branches. (e.g. first onto CEP-21, then to trunk, or vice versa. But also then between 5.0 and trunk)3. Minimize work arising from temporary solutions that support goal #1If CEP-21 is the major source of uncertainty right now, then we should merge CEP-15 to trunk independently. If CEP-15 depending on CEP-21 just means an additional month (or two) delay, then that's fine and there's no need to do additional work just to get some preview release out earlier.henrikOn Sat, Mar 25, 2023 at 4:17 AM Caleb Rackliffe  wrote:I agree there’s little point in litigating right now, given test stability (or lack thereof) in cep-21-tcm. Eventually, though, I’m more or less aligned w/ David in the sense that getting ourselves planted on top of TCM as soon as possible is a good idea.On Mar 24, 2023, at 3:04 PM, Benedict  wrote:It’s not even clear such an effort would need to be different from that used by cep-21. The point is that there’s not much point litigating this now when we can keep our options open and better decouple the projects, since I don’t think we lose much this way.On 24 Mar 2023, at 19:58, David Capwell  wrote:Assuming we do not release with it, then yes, as we wouldn’t need to maintain.  My point for this case was that I don’t feel the time cost is worth it, I am not -1 if someone wants to add, was more saying our time is better off else where.We currently don’t touch Transactional Metadata, we have custom logic (which relies on client to tell every instance in the cluster an update happened) that we are using right now in tests and deployed clusters.  So once we can integrate with Transactional Metadata, we stop relying on clients to tell us about topology changes… a custom/thrown away epoch generator would make this current process more user friendly, but would need to spend time to make sure stable when deployed on a clusterOn Mar 24, 2023, at 12:43 PM, Josh McKenzie  wrote:If this is in a release, we then need to maintain that feature, so would be against it.Isn't the argument that cep-21 provides this so we could just remove the temporary impl and point to the new facility for this generation?On Fri, Mar 24, 2023, at 3:22 PM, David Capwell wrote:the question we want to answer is whether or not we build a throwaway patch for linearizable epochsIf this is in a release, we then need to maintain that feature, so would be against it.If this is for testing, then I would argue the current world is “fine”… current world is hard to use and brittle (users need to tell accord that the cluster changed), but if accord is rebasing on txn metadata then this won’t be that way long (currently blocked from doing that due to txn metadata not passing all tests yet).On Mar 24, 2023, at 12:12 PM, Josh McKenzie  wrote:FWIW, I'd still rather just integrate w/ TCM ASAP, avoiding integration risk while accepting the possible delivery risk.What does the chain of rebases against trunk look like here? cep-21-tcm rebase, then cep-15 on cep-21-tcm, then cep-7 on cep-21-tcm, then a race on whichever of 15 or 7 merge after 21 goes into trunk? Or 7 based on 15, the other way around...I'm not actively working on any of these branches so take my perspective with that appropriate grain of salt, but the coupling of these things seems to have it's own kind of breed of integration pain to upkeep over time depending on how frequently we're rebasing against trunk.the question we want to answer is whether or not we build a throwaway patch for linearizable epochsDo we have an informed opinion on how long we think this would take? Seems like that'd help clarify whether or not there's contributors with the bandwidth and desire to even do that or whether everyone depending on cep-21 is our option.On Fri, Mar 24, 2023, at 1:30 PM, Caleb Rackliffe wrote:I actually did a dry run rebase of cep-15-accord on top of cep-21-tcm here: https://github.com/apache/cassandra/pull/2227It wasn't too terrible, and I was actually able to get the main CQL-based Accord tests working as long as I disabled automatic forwardi

Re: [EXTERNAL] Re: Cassandra CI Status 2023-01-07

2023-03-27 Thread Josh McKenzie
I'll take build lead for the next 2 weeks.

On Sat, Mar 25, 2023, at 4:50 PM, Mick Semb Wever wrote:
>> Here comes Cassandra CI status for  2023-3-13 - 2023-23-179 :
>> 
>> *** CASSANDRA-18338  
>> -  dtest.bootstrap_test.TestBootstrap.test_cleanup trunk
>> ***  CASSANDRA-18338  
>> - test: 
>> org.apache.cassandra.distributed.test.ByteBuddyExamplesTest.countTest , this 
>> failed twice with jdk 8 and jdk 11, on trunk and  4.1
>> others are some timeout exception.
> 
> 
> New failures from Week 12
> *** A possible regression from CASSANDRA-18328 on 2.x to 3.x dtest upgrades
> 
> otherwise all test failures are timeouts.
> 
> We need volunteers for the Build Lead the weeks ahead. 
> 
> 
> 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
On the Sidecar discussion, while Sidecar is the preferred mechanism for the 
reasons described, the API is sufficiently generic enough to plugin a user 
implementations (essentially provide a list of sstables for a token range, and 
a mechanism to open an InputStream on any SSTable file component). A user could 
- for example - easily read from backup snapshots on a blob store.

> On Mar 26, 2023, at 1:04 PM, Josh McKenzie  wrote:
> 
> I want to second what Yifan's spoken to, specifically in terms of resource 
> isolation and availability.
> 
> While the sidecar hasn't seen a ton of traffic and contributions since the 
> acceptance into the project and clearance of CEP-1, my intuition is that 
> that's due to the entrenched maturity of alternative sidecars out there since 
> we were slow as a project to build one, not out of a lack of demand for a 
> fully fleshed out sidecar. As functionality shows up in the ASF C* Sidecar, 
> there's going to be tension as operators are incentivized to run both their 
> bespoke sidecars they may be running alongside the ASF C* one. That's to be 
> expected and a necessary pain to take on during a transition that I 
> personally think is sorely needed.
> 
> Having bulk operations for analytics and for reading and writing SSTables is 
> a pretty compelling carrot, and the more folks we can get running the sidecar 
> and the more contributors active on it, the more we can expect to see 
> interest and work show up there (repair coordination, REST API's, etc - all 
> of which we've talked about before on ML or slack).
> 
> So I'm a strong +1 to it living in the sidecar.
> 
> On Sat, Mar 25, 2023, at 11:05 AM, Brandon Williams wrote:
>> Oh, that's significantly different and great news, please do!  Thanks
>> for the clarification, Doug!
>> 
>> Kind Regards,
>> Brandon
>> 
>> On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer > > wrote:
>> >
>> > I agree that the analytics library will need to support vnodes. To be 
>> > clear, there’s nothing preventing the solution from working with vnodes 
>> > right now, and no assumptions about a 1:1 topology between a token and a 
>> > node. However, we don’t, today, have the ability to test vnode support 
>> > end-to-end. We are working towards that, however, and should be able to 
>> > remove the caveat from the released analytics library once we can properly 
>> > test vnode support.
>> > If it helps, I can update the CEP to say something more like “Caveat: 
>> > Currently untested with vnodes - work is ongoing to remove this 
>> > limitation” if that helps?
>> >
>> > Doug
>> >
>> > > On Mar 24, 2023, at 11:43 AM, Brandon Williams > > > > wrote:
>> > >
>> > > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > > mailto:jeremiah.jor...@gmail.com>> wrote:
>> > >>
>> > >> I have concerns with the majority of this being in the sidecar and not 
>> > >> in the database itself.  I think it would make sense for the server 
>> > >> side of this to be a new service exposed by the database, not in the 
>> > >> sidecar.  That way it can be able to properly integrate with the 
>> > >> authentication and authorization apis, and to make it a first class 
>> > >> citizen in terms of having unit/integration tests in the main DB 
>> > >> ensuring no one breaks it.
>> > >
>> > > I don't think this can/should happen until it supports the database's
>> > > default configuration with vnodes.
>> >



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread Jeremy Hanna
Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds like 
you've been using this for some time.  I understand from the rejected 
alternatives that the Spark Cassandra Connector was slower because it goes 
through the read and write path for C* rather than this backdoor mechanism.  In 
your experience using this, under what circumstances have you seen that this 
tool is not a good fit for analytics - such as complex predicates?  The 
challenge with the Spark Cassandra Connector and previously the Hadoop 
integration is that it had to do full table scans even to get small amounts of 
data.  It sounds like this is similar in that it has to do a full table scan 
but with the advantage of being faster and less load on the cluster.  In other 
words, I'm asking if this has been a replacement for the Spark Cassandra 
Connector or if there are cases in your work where SCC is a better fit.

Also to Benjamin's point in the comments on the CEP itself, how coupled is this 
to internals?  Are there going to be higher level APIs or is it going to call 
internal storage classes directly?

Thanks!

Jeremy


> On Mar 23, 2023, at 12:33 PM, Doug Rohrer  wrote:
> 
> Hi everyone,
> 
> Wiki: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
> 
> We’d like to propose this CEP for adoption by the community.
> 
> It is common for teams using Cassandra to find themselves looking for a way 
> to interact with large amounts of data for analytics workloads. However, 
> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
> as the native read/write paths weren’t designed for bulk analytics.
> 
> We’re proposing this CEP for this exact purpose. It enables the 
> implementation of custom Spark (or similar) applications that can either read 
> or write large amounts of Cassandra data at line rates, by accessing the 
> persistent storage of nodes in the cluster via the Cassandra Sidecar.
> 
> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
> that allows deep integration into Apache Spark that allows its users to bulk 
> import or export data from a running Cassandra cluster with minimal to no 
> impact to the read/write traffic.
> 
> We will shortly publish a branch with code that will accompany this CEP to 
> help readers understand it better.
> 
> As a reminder, please keep the discussion here on the dev list vs. in the 
> wiki, as we’ve found it easier to manage via email.
> 
> Sincerely,
> 
> Doug Rohrer & James Berragan



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
Complex predicates on non-partition keys naturally require pulling the entire 
data set into the Spark DataFrame to perform the query. We have some 
optimizations around column filtering and partition key predicates, utilizing 
the Filter.db/Summary.db/Index.db files to only read the data it needs. We have 
chatted to Caleb about utilizing the index file for SAIs but at present it is 
purely theoretical.

In terms of internals, beyond some util/serializer classes, the writer part 
depends on the CQLSSTableWriter and the reader uses the SSTableSimpleIterator 
and the CompactionIterator.

James.

> On Mar 27, 2023, at 3:06 PM, Jeremy Hanna  wrote:
> 
> Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds 
> like you've been using this for some time.  I understand from the rejected 
> alternatives that the Spark Cassandra Connector was slower because it goes 
> through the read and write path for C* rather than this backdoor mechanism.  
> In your experience using this, under what circumstances have you seen that 
> this tool is not a good fit for analytics - such as complex predicates?  The 
> challenge with the Spark Cassandra Connector and previously the Hadoop 
> integration is that it had to do full table scans even to get small amounts 
> of data.  It sounds like this is similar in that it has to do a full table 
> scan but with the advantage of being faster and less load on the cluster.  In 
> other words, I'm asking if this has been a replacement for the Spark 
> Cassandra Connector or if there are cases in your work where SCC is a better 
> fit.
> 
> Also to Benjamin's point in the comments on the CEP itself, how coupled is 
> this to internals?  Are there going to be higher level APIs or is it going to 
> call internal storage classes directly?
> 
> Thanks!
> 
> Jeremy
> 
> 
>> On Mar 23, 2023, at 12:33 PM, Doug Rohrer  wrote:
>> 
>> Hi everyone,
>> 
>> Wiki: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>> 
>> We’d like to propose this CEP for adoption by the community.
>> 
>> It is common for teams using Cassandra to find themselves looking for a way 
>> to interact with large amounts of data for analytics workloads. However, 
>> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
>> as the native read/write paths weren’t designed for bulk analytics.
>> 
>> We’re proposing this CEP for this exact purpose. It enables the 
>> implementation of custom Spark (or similar) applications that can either 
>> read or write large amounts of Cassandra data at line rates, by accessing 
>> the persistent storage of nodes in the cluster via the Cassandra Sidecar.
>> 
>> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
>> that allows deep integration into Apache Spark that allows its users to bulk 
>> import or export data from a running Cassandra cluster with minimal to no 
>> impact to the read/write traffic.
>> 
>> We will shortly publish a branch with code that will accompany this CEP to 
>> help readers understand it better.
>> 
>> As a reminder, please keep the discussion here on the dev list vs. in the 
>> wiki, as we’ve found it easier to manage via email.
>> 
>> Sincerely,
>> 
>> Doug Rohrer & James Berragan
>