Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Mick Semb Wever Mon, 30 Sep 2024 10:59:07 -0700

Agree with Jon, Josh and Patrick here.

This is the type of hidden subproject that will get us into trouble with
the board/foundation.   I'm sure it's getting enough committer eyeballs,
and some PMC oversight, but maybe not enough.  Addressing the more material
points that Jon mentions is the best way to deal with that IMHO.




On Mon, 30 Sept 2024 at 20:37, Jon Haddad <[email protected]> wrote:

> I think it depends on what lens you're looking at the sidecar through.
>
> If you're actively working on it, and pulling it into your own infra,
> sure.  It's a thing.
>
> If you're an outsider?  I have a hard time seeing it.
>
> - No documentation as to what it does
> - No releases
> - No build instructions
> - Trying to build using standard gradle commands fails [1]
> - Included configs don't work out of the box. [2][3]
> - CEP-1 looks abandonded
> - The primary reason right now to use it looks to be analytics library,
> which doesn't work for most teams due to lack of vnode support [4]
>
> I think if you were to take a poll of 100 users outside this ML, I'd bet
> almost every one would agree the sidecar isn't a thing yet, and that's
> probably more important than if it's actually getting worked on.  I think
> it has quite a ways to go before it looks to be more than an idea that's
> incubating.
>
> [1] https://issues.apache.org/jira/browse/CASSANDRASC-120
> [2 https://issues.apache.org/jira/browse/CASSANDRASC-121
> [3] https://issues.apache.org/jira/browse/CASSANDRASC-122
> [4] https://issues.apache.org/jira/browse/CASSANDRA-19594
>
>
> On Mon, Sep 30, 2024 at 11:14 AM Josh McKenzie <[email protected]>
> wrote:
>
>> The CEP for the sidecar has stalled. The sidecar itself is very much
>> alive and a thing.
>>
>> CEP != artifact.
>>
>> We should definitely clean that up though.
>>
>> On Mon, Sep 30, 2024, at 10:59 AM, Dinesh Joshi wrote:
>>
>> Patrick, could you please elaborate? The Sidecar has been a thing for a
>> while now.
>>
>> On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <[email protected]>
>> wrote:
>>
>> I made the mistake of asking two things in one email.
>>
>> First thing I asked. Sidecar? Stalled CEP so why is this being talked
>> about like this is a thing?
>>
>> On Mon, Sep 30, 2024 at 7:21 AM Benedict <[email protected]> wrote:
>>
>>
>> Sorry Bernardo, you may have misunderstood me. I don’t have any concerns,
>> I was suggesting a possible future scenario where CDC for Kafka via sidecar
>> is changed to use a hypothetical future topic subscription service provided
>> by C*. It was meant to show that this CEP may be easily decoupled from any
>> future evolution in this area.
>>
>>
>> On 30 Sep 2024, at 14:58, Bernardo Botella <[email protected]>
>> wrote:
>>
>> Thanks everyone for the comments.
>>
>>
>> Patrick:
>> The proposal includes a “best effort” approach for deduplication (some
>> details can be found on the Digest class comments on the PR here
>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193
>>  ).
>> That alone won’t eliminate all the duplicates, but as Josh points out, it
>> moves the line to something way easier to handle for consumers, and
>> definitely on the direction we should aim towards. Accord is definitely
>> something this contribution will benefit from, that will move that line
>> even further.
>>
>> Benedict:
>> If I understand it correctly, your concern is that Kafka is somewhat the
>> hardcoded option for a CDC stream being published? The proposal introduces
>> a concept of data sources and sinks (
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks)
>> being kafka the first implemented data sink. That means that the actual
>> Kafka output should (will) be something pluggable.
>>
>>
>>
>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <[email protected]> wrote:
>>
>> I don't see much on how this would be handled other than "left to the end
>> user to figure out."
>>
>> My immediate thought when I read that was "Yes. But it's moving where we
>> draw the line of 'left to the end user to figure out' *much further* than
>> it was before".
>>
>> This should only be necessary in edge cases w/extended severe degraded
>> availability where you can't hit QUORUM w/this design. So we go from
>> "De-dupe literally everything o ye' user" to "de-dupe a small fraction of a
>> % of the time when things really go off the rails".
>>
>> It still leaves the burden of processing potential duplicates downstream,
>> so some *complexity* burden on the users remains if they have no
>> tolerance for processing duplicate messages, however the underlying machine
>> resource utilization (from "dedupe everything" to "dedupe a small % of
>> things") is pretty massively shifted by this design change. That, and using
>> the hash of the mutation the way the extended design does is something a
>> downstream consumer could also do on their side to ensure anything that
>> came in past the drop-off window wasn't already seen. So not *too* painful;
>> certainly a vast improvement over the status quo.
>>
>> As to TCM and Accord: absolutely agree. I'd love to see a world where we
>> Accord everything and fire off CDC to subscribers from a coordinator
>> bypassing all this LSM-bastardized post-processing for CDC for instance.
>> That said, this is a functionality users needed back in... 2016? When we
>> first added CDC. So I think it's worth it to move on it now while retaining
>> architectural options to move to updated metadata and transactions as they
>> mature (obviously we'll lean on TCM since it's in 5.0 / trunk right now;
>> more applies to the accord bit).
>>
>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:
>>
>>
>> Yes, with accord it should be fairly easy to have reliable no-dupe log
>> streaming without an elected leader. Given the broad set of use cases, I
>> can imagine supporting some more native topic subscription API in C* rather
>> than requiring Kafka, so perhaps any integration of Kafka with the sidecar
>> can be considered a separate parallel effort, that might eventually
>> implement itself with this C* feature whenever it materialises?
>>
>>
>> On 30 Sep 2024, at 03:42, Jeff Jirsa <[email protected]> wrote:
>>
>> 
>>
>> Transactional metadata and Accord should make it MUCH easier to do
>> duplication avoiding CDC (and I was going to note that someone should
>> ensure that the interfaces exposed to the public are stable enough not to
>> change the published api once those exist)
>>
>>
>>
>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <[email protected]> wrote:
>>
>> 
>> As I was reviewing this, it occurred to me that it was talking about
>> Sidecar like it was a thing but that CEP has been stalled for quite some
>> time:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224
>>
>> If work on this is being done, should we get this official and wrapped up?
>>
>> On to the proposal...
>>
>> This has been a topic on the project for over 10 years now. I've seen
>> multiple goes at making this work and the issue that always turns out to
>> torpedo the project is handing dupes. To the point that they go from a
>> generalized Kafka producer engine to something specific to a particular use
>> case. I don't see much on how this would be handled other than "left to the
>> end user to figure out."
>>
>> There is also little mention of where the increased resource load would
>> be handled.
>>
>> This has been discussed many times before, but is it time to introduce
>> the concept of an elected leader for a token range for this type of
>> operation? It would eliminate a ton of problems that need to managed when
>> bridging c* to a system like Kafka. Last time it was discussed in earnest
>> was for KIP-30:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>>
>>
>> Patrick
>>
>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <[email protected]>
>> wrote:
>>
>> Yes! I’m really looking forward to trying this out. The CEP looks really
>> well thought out. I think this will make CDC a lot more useful for a lot of
>> teams.
>> Jon
>>
>>
>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <[email protected]>
>> wrote:
>>
>>
>> Really excited to see this hit the ML James.
>>
>> As author of the base CDC (get your stones ready for throwing :D) and
>> someone moderately involved in the CEP here, definitely welcome any
>> questions. CDC is a *thorny* *problem *in a multi-replica distributed
>> system like this.
>>
>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>>
>> Hi everyone,
>>
>> Wiki:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>>
>> We would like to propose this CEP for adoption by the community.
>>
>> CDC is a common technique in databases but right now there is no
>> out-of-the-box solution to do this easily and at scale with Cassandra. Our
>> proposal is to build a fully-fledged solution into the Apache Cassandra
>> Sidecar. This comes with a number of benefits:
>> - Sidecar is an official part of the existing Cassandra eco-system.
>> - Sidecar runs co-located with Cassandra instances and so scales with the
>> cluster size.
>> - Sidecar can access the underlying Cassandra database to store CDC
>> configuration and the CDC state in a special table.
>> - Running in the Sidecar does not require additional external resources
>> to run.
>>
>> The core CDC module we anticipate will be pluggable and re-usable, it is
>> available for review here:
>> https://github.com/apache/cassandra-analytics/pull/87. The remaining
>> Sidecar code will follow.
>>
>> As a reminder, please keep the discussion here on the dev list vs. in the
>> wiki, as we’ve found it easier to manage via email.
>>
>> Sincerely,
>> James Berragan
>> Bernardo Botella Corbi
>> Yifan Cai
>> Jyothsna Konisa
>>
>>
>>

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to