Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Jeff Jirsa Tue, 28 Mar 2023 07:36:42 -0700

On Tue, Mar 28, 2023 at 7:30 AM Jeremiah D Jordan <jeremiah.jor...@gmail.com>
wrote:


> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
>
>
> How does having this in a side car change the impact on “storage
> performance”?  The side car reading sstables will have the same impact on
> storage IO as the main process reading sstables.
>

This is true.


>  Given the sidecar is running on the same node as the main C* process, the
> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all
> still shared between the main C* process and the side car, and coordinating
> those across processes is harder than coordinating them within a single
> process. For example if we wanted to have the compaction throughput,
> streaming throughput, and analytics read throughput all tied back to a
> single disk IO cap, that is harder with an external process.
>

Relatively trivial, for CPU and memory, to run them in different
containers/cgroups/etc, so you can put an exact cpu/memory limit on the
sidecar. That's different from a jmx rate limiter/throttle, but (arguably)
more precise, because it actually limits the underlying physical resource
instead of a proxy for it in a config setting.



>
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
>
>
> Not sure that is really very complex, running an http service is a pretty
> easy?  We already have netty in use to instantiate one from.
> I worry more about the complexity of having the matching schema for a set
> of sstables being read.  The complexity of new sstable versions/formats
> being introduced.  The complexity of having up to date data from memtables
> being considered by this API without having to flush before every query of
> it.  The complexity of dealing with the new memtable API introduced in
> CEP-11.  The complexity of coordinating compaction/streaming adding and
> removing files with these APIs reading them.  There are a lot of edge cases
> to consider for this external access to sstables that the main process
> considers itself the “owner” of.
>
> All of this is not to say that I think separating things out into other
> processes/services is bad.  But I think we need to be very careful with how
> we do it, or end users will end up running into all the sharp edges and the
> feature will fail.
>
> -Jeremiah
>
> On Mar 24, 2023, at 8:15 PM, Yifan Cai <yc25c...@gmail.com> wrote:
>
> Hi Jeremiah,
>
> There are good reasons to not have these inside Cassandra. Consider the
> following.
> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
> - Availability. If the Cassandra cluster is being bounced, using sidecar
> would not affect the SBR/SBW functionality, e.g. SBR can still read
> SSTables via sidecar endpoints.
> - Compatibility. Sidecar provides stable REST-based APIs, such as
> uploading SSTables endpoint, which would remain compatible with different
> versions of Cassandra. The current implementation supports versions 3.0 and
> 4.0.
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
> - Release velocity. Sidecar, as an independent project, can have a quicker
> release cycle from Cassandra.
> - The features in sidecar are mostly implemented based on various existing
> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>
> Regarding authentication and authorization
> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold
> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>
> - Yifan
>
> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer <droh...@apple.com> wrote:
>
>> I agree that the analytics library will need to support vnodes. To be
>> clear, there’s nothing preventing the solution from working with vnodes
>> right now, and no assumptions about a 1:1 topology between a token and a
>> node. However, we don’t, today, have the ability to test vnode support
>> end-to-end. We are working towards that, however, and should be able to
>> remove the caveat from the released analytics library once we can properly
>> test vnode support.
>> If it helps, I can update the CEP to say something more like “Caveat:
>> Currently untested with vnodes - work is ongoing to remove this limitation”
>> if that helps?
>>
>> Doug
>>
>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams <dri...@gmail.com>
>> wrote:
>> >
>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > <jeremiah.jor...@gmail.com> wrote:
>> >>
>> >> I have concerns with the majority of this being in the sidecar and not
>> in the database itself.  I think it would make sense for the server side of
>> this to be a new service exposed by the database, not in the sidecar.  That
>> way it can be able to properly integrate with the authentication and
>> authorization apis, and to make it a first class citizen in terms of having
>> unit/integration tests in the main DB ensuring no one breaks it.
>> >
>> > I don't think this can/should happen until it supports the database's
>> > default configuration with vnodes.
>>
>>
>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to