Re: [DISCUSS] CEP-15: General Purpose Transactions

bened...@apache.org Wed, 22 Sep 2021 03:54:47 -0700

Sure, that works for me.

From: Patrick McFadin <pmcfa...@gmail.com>
Date: Wednesday, 22 September 2021 at 04:47
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I would be happy to host a Zoom as I've done in the past. I can post a
transcript and the recording after the call.


Instead of right after your talk Benedict, maybe we can set a time for next
week and let everyone know the time?

Patrick

On Mon, Sep 20, 2021 at 11:05 AM bened...@apache.org <bened...@apache.org>
wrote:

> Hi Joey,
>
> Thanks for the feedback and suggestions.
>
> > I was wondering what do you think about having some extended Q&A after
> your ApacheCon talk Wednesday
>
> I would love to do this. I’ll have to figure out how though – my
> understanding is that I have a hard 40m for my talk and any Q&A, and I
> expect the talk to occupy most of those 40m as I try to cover both the
> CEP-14 and CEP-15. I’m not sure what facilities are made available by
> Hopin, but if necessary we can perhaps post some external video chat link?
>
> The time of day is also a question, as I think the last talk ends at
> 9:20pm local time. But we can make that work if necessary.
>
> > It might help to have a diagram (perhaps I can collaborate with you
> on this?)
>
> I absolutely agree. This is something I had planned to produce but it’s
> been a question of time. In part I wanted to ensure we published long in
> advance of ApacheCon, but now also with CEP-10, CEP-14 and CEP-15 in flight
> it’s hard to get back to improving the draft. If you’d be interested in
> collaborating on this that would be super appreciated, as this would
> certainly help the reader.
>
> >I think that WAN is always paid during the Consensus Protocol, and then
> in most cases execution can remain LAN except in 3+ datacenters where I
> think you'd have to include at least one replica in a neighboring
> datacenter…
>
> As designed the only WAN cost is consensus as Accord ensures every replica
> receives a complete copy of every transaction, and is aware of any gaps. If
> there are gaps there may be WAN delays as those are filled in. This might
> occur because of network outages, but is most likely to occur when
> transactions are being actively executed by multiple DCs at once – in which
> case there’ll be one further unidirectional WAN latency during execution
> while the earlier transaction disseminates its result to the later
> transaction(s). There are other similar scenario we can discuss, e.g. if a
> transaction takes the slow path and will execute after a transaction being
> executed in another DC, that remote transaction needs to receive this
> notification before executing.
>
> There might potentially be some interesting optimisations to make in
> future, where with many queued transactions a single DC may nominate itself
> to execute all outstanding queries and respond to the remote DCs that
> issued them so as to eliminate the WAN latency for disseminating the result
> of each transaction. But we’re getting way ahead of ourselves there 😊
>
> There’s also no LAN cost on write, at least for responding to the client.
> If there is a dependent transaction within the same DC then (as in the
> above case) there will be a LAN penalty for the second transaction to
> execute.
>
> > Relatedly I'm curious if there is any way that the client can
> acquire the timestamp used by the transaction before sending the data
> so we can make the operations idempotent and unrelated to the
> coordinator that was executing them as the storage nodes are
> vulnerable to disk and heap failure modes which makes them much more
> likely to enter grey failure (slow). Alternatively, perhaps it would
> make sense to introduce a set of optional dedicated C* nodes for
> reaching consensus that do not act as storage nodes so we don't have
> to worry about hanging coordinators (join_ring=false?)?
>
> So, in principle coordination can be performed by any node on the network
> including a client – though we’d need to issue the client a unique id this
> can be done cheaply on joining. This might be something to explore in
> future, though there are downsides to having more coordinators too (more
> likely to fail, and stall further transactions that depend on transactions
> it is coordinating).
>
> However, with respect to idempotency, I expect Accord not to perpetuate
> the problems of LWTs where the result of an earlier query is unknown. At
> least success/fail will be maintained in a distributed fashion for some
> reasonable time horizon, and there will also be protection against zombie
> transactions (those proposed to a node that went into a failure spiral
> before reaching healthy nodes, that somehow regurgitates it hours or days
> later), so we should be able to provide practical precisely-once semantics
> to clients.
>
> Whether this is done with a client provided timestamp, or simply some
> other arbitrary client-provided id that can be utilised to deduplicate
> requests or query the status of a transaction is something we can explore
> later. This is something we should explore in a dedicated discussion as
> development of Accord progresses.
>
> > Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should
> line 2 read Qt instead of Et?
>
> So, technically as it reads today I think it’s correct. For Line 2 there
> is always some Qt \subseteq Et. I think the problem here is that actually
> there’s a bunch of valid things to do, including picking some arbitrary
> subset of each rho in Pt so long as it contains some Qt. It’s hard to
> convey the range of options precisely. Line 12 of course really wants to
> execute only when some Ft has responded, but if no such response is
> forthcoming it wants to execute on some Qt, but of course Ft \superseteq
> Qt. Perhaps I should try to state the set inequalities here. I will think
> about what I can do to improve the clarity, thanks.
>
> > It might make sense for participating members to wait for a minimum
> detected clock skew before becoming eligible for electorate?
>
> This is a great idea, thanks!
>
> > I don't really understand how temporarily down replicas will learn
> of mutations they missed .. are we just leveraging some
> external repair?
>
> Yes, precisely. Though in practice any transaction they need to know to
> answer a Read etc, they can query a peer for. But in practice I expect to
> deliver a real-time repair mechanism scoped (initially, at least) to Accord
> transactions to ensure this happens promptly.
>
> > Relatedly since non-transactional reads wouldn't flow through
> consensus (I hope) would it make sense for a restarting node to learn
> the latest accepted time once and then be deprioritized for all reads
> until it has accepted what it missed? Or is the idea that you would
> _always_ read transactionally (and since it's a read only transaction
> you can skip the WAN consensus and just go straight to fast path
> reads)?
>
> I expect that tables will be marked transactional, and that every
> operation that goes through them will be transactional. However I can
> imagine offering weaker read semantics, particularly if you’re looking to
> avoid paying the WAN price if you aren’t worried about consistency. I
> haven’t really considered how we might marry the two within a table, and
> I’m open to suggestions here. I expect that this dovetails with future
> improvements to transactional cluster metadata. I think also in part this
> kind of behaviour is limited today because repair is too unwieldy, and also
> because we don’t have an “on but catching up” state. If we improve repair
> for transactions the first part may be solved, and perhaps we can introduce
> a new node state as part of improving our approach to cluster management.
>
> I could imagine having some bounded divergence  in general, e.g. I haven’t
> corroborated my transaction history in Xms with a majority, or I haven’t
> received Xms of the transaction history I’ve witnessed, so I’m going to
> remove myself from the read set for non-transactional operations. But I
> don’t envisage this landing in V1.
>
> * I know the paper says that we elide details of how the shards (aka
> replica sets?) are chosen, but it seems that this system would have a
> hard dependency on a strongly consistent shard selection system (aka
> token metadata?) wouldn't it? In particular if the simple quorums
> (which I interpreted to be replica sets in current C*, not sure if
> that's correct) can change in non linearizable ways I don't think
> Property 3.3 can hold. I think you hint at a solution to this in
> section 5 but I'm not sure I grok it.
>
> Yes, it does. That’s something that’s in hand, and colleagues will be
> reaching out to the list about in the next couple of months. I anticipate
> this being a solved problem before Accord depends on it. There’s still a
> bunch of complexity within Accord for applying topology changes safely
> (which Section 5 nods to), but the membership decisions will be taken by
> Cassandra – safely.
>
>
> From: Joseph Lynch <joe.e.ly...@gmail.com>
> Date: Monday, 20 September 2021 at 17:17
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Benedict,
>
> Thank you very much for advancing this proposal, I'm extremely excited
> to see flexible quorums used in this way and am looking forward to the
> integration of Accord into Cassandra! I read the whitepaper and have a
> few questions, but I was wondering what do you think about having some
> extended Q&A after your ApacheCon talk Wednesday (maybe at the end of
> the C* track)? It might be higher bandwidth than going back and forth
> on email/slack (also given you're presenting on it that might be a
> good time to discuss it)?
>
> Briefly
> * It might help to have a diagram (perhaps I can collaborate with you
> on this?) showing the happy path delay waiting in the reorder buffer
> and the messages that are sent in a 2 and 3 datacenter deployment
> during the PreAccept, Accept, Commit, Execute, Apply phases. In
> particular it was hard for me to follow where exactly I was paying WAN
> latency and where we could achieve progress with LAN only (I think
> that WAN is always paid during the Consensus Protocol, and then in
> most cases execution can remain LAN except in 3+ datacenters where I
> think you'd have to include at least one replica in a neighboring
> datacenter). In particular, it seems that Accord always pays clock
> skew + WAN latency during the reorder buffer (as part of consensus) +
> 2x LAN latency during execution (to read and then write).
> * Relatedly I'm curious if there is any way that the client can
> acquire the timestamp used by the transaction before sending the data
> so we can make the operations idempotent and unrelated to the
> coordinator that was executing them as the storage nodes are
> vulnerable to disk and heap failure modes which makes them much more
> likely to enter grey failure (slow). Alternatively, perhaps it would
> make sense to introduce a set of optional dedicated C* nodes for
> reaching consensus that do not act as storage nodes so we don't have
> to worry about hanging coordinators (join_ring=false?)?
> * Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should
> line 2 read Qt instead of Et?
> * I think your claims about clock skew being <1ms in general is
> accurate at least for AWS except for when machines boot for the first
> time (I can send you some data shortly). It might make sense for
> participating members to wait for a minimum detected clock skew before
> becoming eligible for electorate?
> * I don't really understand how temporarily down replicas will learn
> of mutations they missed, did I miss the part where a read replica
> would recover all transactions between its last accepted time and
> another replica's last accepted time? Or are we just leveraging some
> external repair?
> * Relatedly since non-transactional reads wouldn't flow through
> consensus (I hope) would it make sense for a restarting node to learn
> the latest accepted time once and then be deprioritized for all reads
> until it has accepted what it missed? Or is the idea that you would
> _always_ read transactionally (and since it's a read only transaction
> you can skip the WAN consensus and just go straight to fast path
> reads)?
> * I know the paper says that we elide details of how the shards (aka
> replica sets?) are chosen, but it seems that this system would have a
> hard dependency on a strongly consistent shard selection system (aka
> token metadata?) wouldn't it? In particular if the simple quorums
> (which I interpreted to be replica sets in current C*, not sure if
> that's correct) can change in non linearizable ways I don't think
> Property 3.3 can hold. I think you hint at a solution to this in
> section 5 but I'm not sure I grok it.
>
> Super interesting proposal and I am looking forward to all the
> improvements this will bring to the project!
>
> Cheers,
> -Joey
>
> On Mon, Sep 20, 2021 at 1:34 AM Miles Garnsey
> <miles.garn...@datastax.com> wrote:
> >
> > If Accord can fulfil its aims it sounds like a huge improvement to the
> state of the art in distributed transaction processing. Congrats to all
> involved in pulling the proposal together.
> >
> > I was holding off on feedback since this is quite in depth and I don’t
> want to bike shed, I still haven’t spent as much time understanding this as
> I’d like.
> >
> > Regardless, I’ll make the following notes in case they’re helpful. My
> feedback is more to satisfy my own curiosity and stimulate discussion than
> to suggest that there are any flaws here. I applaud the proposed testing
> approach and think it is the only way to be certain that the proposed
> consistency guarantees will be upheld.
> >
> > General
> >
> > I’m curious if/how this proposal addresses issues we have seen when
> scaling; I see reference to simple majorities of nodes - is there any plan
> to ensure safety under scaling operations or DC (de)commissioning?
> >
> > What consistency levels will be supported under Accord? Will it simply
> be a single CL representing a majority of nodes across the whole cluster?
> (This at least would mitigate the issues I’ve seen when folks want to
> switch from EACH_SERIAL to SERIAL).
> >
> > Accord
> >
> > > Accord instead assembles an inconsistent set of dependencies.
> >
> >
> > Further explanation here would be good. Do we mean to say that the
> dependancies may differ according to which transactions the coordinator has
> witnessed at the time the incoming transaction is first seen? This would
> make sense if some nodes had not fully committed a foregoing transaction.
> >
> > Is it correct to think of this step as assembling a dependancy graph of
> foregoing transactions which must be completed ahead of progressing the
> incoming new transaction?
> >
> > Fast Path
> >
> > > A coordinator C proposes a timestamp t0 to at least a quorum of a fast
> path electorate. If t0 is larger than all timestamps witnessed for all
> prior conflicting transactions, t0 is accepted by a replica. If a fast path
> quorum of responses accept, the transaction is agreed to execute at t0.
> Replicas respond with the set of transactions they have witnessed that may
> execute with a lower timestamp, i.e. those with a lower t0.
> >
> > What is t0 here? I’m guessing it is the Lamport clock time of the most
> recent mutation to the partition? May be worth clarifying because otherwise
> the perception may be that it is the commencement time of the current
> transaction which may not be the intention.
> >
> > Regarding the use of logical clocks in general -
> >
> > Do we have one clock-per-shard-per-node? Or is there a single clock for
> all transactions on a node?
> > What happens in network partitions?
> > In a cross-shard transaction does maintaining simple majorities of
> replicas protect you from potential inconsistencies arising when a
> transaction W10 addressing partitions p1, p2 comes from a different
> majority (potentially isolated due to a network partition) from earlier
> writes W[1,9] to p1 only?
> > It seems that this may cause a sudden change to the dependancy graph for
> partition p2 which may render it vulnerable to strange effects?
> > Do we consider adversarial cases or any sort of byzantine faults?
> (That’s a bit out of left field, feel free to kick me.)
> > Why do we prefer Lamport clocks to vector clocks or other types of
> logical clock?
> >
> > Slow Path
> >
> > > This value is proposed to at least a simple majority of nodes, along
> with the union of the dependenciesreceived
> >
> >
> > Related to the earlier point: when we say `union` here - what set are we
> forming a union over? Is it a union of all dependancies t_n < t as seen by
> all coordinators? I presume that the logic precludes the possibility that
> these dependancies will conflict, since all foregoing transactions which
> are in progress as dependancies must be non-conflicting with earlier
> transactions in the dependancy graph?
> >
> > In any case, further information about how the dependancy graph is
> computed would be interesting.
> >
> > > The inclusion of dependencies in the proposal is solely to facilitate
> Recovery of other transactions that may be incomplete - these are stored on
> each replica to facilitate decisions at recovery.
> >
> >
> > Every replica? Or only those participating in the transaction?
> >
> > > If C fails to reach fast path consensus it takes the highest t it
> witnessed from its responses, which constitutes a simple Lamport clock
> value imposing a valid total order. This value is proposed to at least a
> simple majority of nodes,
> >
> >
> > When speaking about the simple majority of nodes to whom the max(t)
> value returned will be proposed to -
> > It sounds like this need not be the same majority from whom the original
> sets of T_n and dependancies was obtained?
> > Is there a proof to show that the dependancies created from the union of
> the first set of replicas resolves to an acceptable dependancy graph for an
> arbitrary majority of replicas? (Especially given that a majority of
> replicas is not a majority of nodes, given we are in a cross-shard scenario
> here).
> > What happens in cases where the replica set has changed due to (a)
> scaling RF in a single DC (b) adding a whole new DC?
> > Wikipedia <https://en.wikipedia.org/wiki/Lamport_timestamp> tells me
> that Lamport clocks only impose partial, not total order. I’m guessing
> we’re thinking of a different type of logical clock when we speak of
> Lamport clocks here (but my expertise is sketchy on this topic).
> >
> > Recovery
> >
> > I would be interested in further exploration of the unhappy path (where
> 'a newer ballot has been issued by a recovery coordinator to take over the
> transaction’). I understand that this may be partially covered in the
> pseudocode for `Recovery` but I’m struggling to reconcile the ’new ballot
> has been issued’ language with the ‘any R in responses had X as Applied,
> Committed, or Accepted’ language.
> >
> > Well done again and thank you for pushing the envelope in this area
> Benedict.
> >
> > Miles
> >
> > > On 15 Sep 2021, at 11:33 pm, bened...@apache.org wrote:
> > >
> > >> I would kind of expect this work, if it pans out, to _replace_ the
> current paxos implementation
> > >
> > > That’s a good point. I think the clear direction of travel would be
> total replacement of Paxos, but I anticipate that this will be
> feature-flagged at least initially. So for some period of time we may
> maintain both options, with the advanced CQL functionality disabled if you
> opt for classic Paxos.
> > >
> > > I think this is a necessary corollary of a requirement to support live
> upgrades – something that is non-negotiable IMO, but that I have also
> neglected to discuss in the CEP. I will rectify this. An open question is
> if we want to support live downgrades back to Classic Paxos. I kind of
> expect that we will, though that will no doubt be informed by the
> difficulty of doing so.
> > >
> > > Either way, this means the deprecation cycle for Classic Paxos is
> probably a separate and future decision for the community. We could choose
> to maintain it indefinitely, but I would vote to retire it the following
> major version.
> > >
> > > A related open question is defaults – I would probably vote for new
> clusters to default to Accord, and existing clusters to need to run a
> migration command after fully upgrading the cluster.
> > >
> > > From: Sylvain Lebresne <lebre...@gmail.com>
> > > Date: Wednesday, 15 September 2021 at 14:13
> > > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > Fwiw, it makes sense to me to talk about CQL syntax evolution
> separately.
> > >
> > > It's pretty clear to me that we _can_ extend CQL to make sure of a
> general
> > > purpose transaction mechanism, so I don't think deciding if we want a
> > > general purpose transaction mechanism has to depend on deciding on the
> > > syntax. Especially since the syntax question can get pretty far on its
> own
> > > and could be a serious upfront distraction.
> > >
> > > And as you said, there are even queries that can be expressed with the
> > > current syntax that we refuse now and would be able to accept with
> this, so
> > > those could be "ground zero" of what this work would allow.
> > >
> > > But outside of pure syntax questions, one thing that I don't see
> discussed
> > > in the CEP (or did I miss it) is what the relationship of this new
> > > mechanism with the existing paxos implementation would be? I would
> kind of
> > > expect this work, if it pans out, to _replace_ the current paxos
> > > implementation (because 1) why not and 2) the idea of having 2
> > > serialization mechanisms that serialize separately sounds like a
> nightmare
> > > from the user POV) but it isn't stated clearly. If replacement is
> indeed
> > > the intent, then I think there needs to be a plan for the upgrade
> path. If
> > > that's not the intent, then what?
> > > --
> > > Sylvain
> > >
> > >
> > > On Wed, Sep 15, 2021 at 12:09 PM bened...@apache.org <
> bened...@apache.org>
> > > wrote:
> > >
> > >> Ok, so the act of typing out an example was actually a really good
> > >> reminder of just how limited our functionality is today, even for
> single
> > >> partition operations.
> > >>
> > >> I don’t want to distract from any discussion around the underlying
> > >> protocol, but we could kick off a separate conversation about how to
> evolve
> > >> CQL sooner than later if there is the appetite. There are no concrete
> > >> proposals to discuss, it would be brainstorming.
> > >>
> > >> Do people also generally agree this work warrants a distinct CEP, or
> would
> > >> people prefer to see this developed under the same umbrella?
> > >>
> > >>
> > >>
> > >> From: bened...@apache.org <bened...@apache.org>
> > >> Date: Wednesday, 15 September 2021 at 09:19
> > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> perhaps we can prepare these as examples
> > >>
> > >> There are grammatically correct CQL queries today that cannot be
> executed,
> > >> that this work will naturally remove the restrictions on. I’m
> certainly
> > >> happy to specify one of these for the CEP if it will help the reader.
> > >>
> > >> I want to exclude “new CQL commands” or any other enhancement to the
> > >> grammar from the scope of the CEP, however. This work will enable a
> range
> > >> of improvements to the UX, but I think this work is a separate,
> long-term
> > >> project of evolution that deserves its own CEPs, and will likely
> involve
> > >> input from a wider range of contributors and users. If nobody else
> starts
> > >> such CEPs, I will do so in due course (much further down the line).
> > >>
> > >> Assuming there is not significant dissent on this point I will update
> the
> > >> CEP to reflect this non-goal.
> > >>
> > >>
> > >>
> > >> From: C. Scott Andreas <sc...@paradoxica.net>
> > >> Date: Wednesday, 15 September 2021 at 00:31
> > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > >> Cc: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Adding a few notes from my perspective as well –
> > >>
> > >> Re: the UX question, thanks for asking this.
> > >>
> > >> I agree that offering a set of example queries and use cases may help
> make
> > >> the specific use cases more understandable; perhaps we can prepare
> these as
> > >> examples to be included in the CEP.
> > >>
> > >> I do think that all potential UX directions begin with the
> specification
> > >> of the protocol that will underly them, as what can be expressed by
> it may
> > >> be a superset of what's immediately exposed by CQL. But at minimum
> it's
> > >> great to have a sense of the queries one might be able to issue to
> focus a
> > >> reading of the whitepaper.
> > >>
> > >> Re: "Can we not start using it as an external dependency, and later
> > >> re-evaluate if it's necessary to bring it into the project or even
> incubate
> > >> it as another Apache project"
> > >>
> > >> I think it would be valuable to the project for the work to be
> incubated
> > >> in a separate repository as part of the Apache Cassandra project
> itself,
> > >> much like the in-JVM dtest API and Harry. This pattern worked well for
> > >> those projects as they incubated as it allowed them to evolve outside
> the
> > >> primary codebase, but subject to the same project governance, set of
> PMC
> > >> members, committers, and so on. Like those libraries, it also makes
> sense
> > >> as the Cassandra project is the first (and at this time) only known
> > >> intended consumer of the library, though there may be more in the
> future.
> > >>
> > >> If the proposal is accepted, the time horizon envisioned for this
> work's
> > >> completion is ~9 months to a standard of production readiness. The
> > >> contributors see value in the work being donated to and governed by
> the
> > >> contribution practices of the Foundation. Doing so ensures that it is
> being
> > >> developed openly and with full opportunity for review and
> contribution of
> > >> others, while also solidifying contribution of the IP to the project.
> > >>
> > >> Spinning up a separate ASF incubation project is an interesting idea,
> but
> > >> I feel that doing so would introduce a far greater overhead in
> process and
> > >> governance, and that the most suitable governance and set of
> committers/PMC
> > >> members are those of the Apache Cassandra project itself.
> > >>
> > >> On Sep 14, 2021, at 3:53 PM, "bened...@apache.org" <
> bened...@apache.org>
> > >> wrote:
> > >>
> > >>
> > >> Hi Paulo,
> > >>
> > >> First and foremost, I believe this proposal in its current form
> focuses on
> > >> the protocol details (HOW?) but lacks the bigger picture on how this
> is
> > >> going to be exposed to the user (WHAT)?
> > >>
> > >> In my opinion this CEP embodies a coherent distinct and complex piece
> of
> > >> work, that requires specialist expertise. You have after all just
> suggested
> > >> a month to read only the existing proposal 😊
> > >>
> > >> UX is a whole other kind of discussion, that can be quite
> opinionated, and
> > >> requires different expertise. It is in my opinion helpful to break
> out work
> > >> that is not tightly coupled, as well as work that requires different
> > >> expertise. As you point out, multi-key UX features are largely
> independent
> > >> of any underlying implementation, likely can be done in parallel, and
> even
> > >> with different contributors.
> > >>
> > >> Can we not start using it as an external dependency
> > >>
> > >> I would love to understand your rationale, as this is a surprising
> > >> suggestion to me. This is just like any other subsystem, but we would
> be
> > >> managing it as a separate library primarily for modularity reasons.
> The
> > >> reality is that this option should anyway be considered unavailable.
> This
> > >> is a proposed contribution to the Cassandra project, which we can
> either
> > >> accept or reject.
> > >>
> > >> Isn't this a good chance to make the serialization protocol pluggable
> > >> with clearly defined integration points
> > >>
> > >> It has recently been demonstrated to be possible to build a system
> that
> > >> can safely switch between different consensus protocols. However,
> this was
> > >> very sophisticated work that would require its own CEP, one that we
> would
> > >> be unable to resource. Even if we could this would be insufficient.
> This
> > >> goal has never been achieved for a multi-shard transaction protocol
> to my
> > >> knowledge, and multi-shard transaction protocols are much more
> divergent in
> > >> implementation detail than consensus protocols.
> > >>
> > >> so we could easily switch implementations with different guarantees…
> (ie.
> > >> Apache Ratis)
> > >>
> > >> As far as I know, there are no other strict serializable protocols
> > >> available to plug in today. Apache Ratis appears to be a
> straightforward
> > >> Raft implementation, and therefore it is a linearizable consensus
> protocol.
> > >> It is not multi-shard transaction protocol at all, let alone strict
> > >> serializable. It could be used in place of Paxos, but not Accord.
> > >>
> > >>
> > >>
> > >> From: Paulo Motta <pauloricard...@gmail.com>
> > >> Date: Tuesday, 14 September 2021 at 22:55
> > >> To: Cassandra DEV <dev@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> I can start with some preliminary comments while I get more
> familiarized
> > >> with the proposal:
> > >>
> > >> - First and foremost, I believe this proposal in its current form
> focuses
> > >> on the protocol details (HOW?) but lacks the bigger picture on how
> this is
> > >> going to be exposed to the user (WHAT)? Is exposing linearizable
> > >> transactions to the user not a goal of this proposal? If not, I think
> the
> > >> proposal is missing the UX (ie. what CQL commands are going to be
> added
> > >> etc) on how these transactions are going to be exposed.
> > >>
> > >> - Why do we need to bring the library into the project umbrella? Can
> we not
> > >> start using it as an external dependency, and later re-evaluate if
> it's
> > >> necessary to bring it into the project or even incubate it as another
> > >> Apache project? I feel we may be importing unnecessary management
> overhead
> > >> into the project while only a small subset of contributors will be
> involved
> > >> with the core protocol.
> > >>
> > >> - Isn't this a good chance to make the serialization protocol
> pluggable
> > >> with clearly defined integration points, so we could easily switch
> > >> implementations with different guarantees, trade-offs and performance
> > >> considerations while leaving the UX intact? This would also allow us
> to
> > >> easily benchmark the protocol against alternatives (ie. Apache Ratis)
> and
> > >> validate the performance claims. I think the best way to do that
> would be
> > >> to define what the feature will look like to the end user (UX),
> define the
> > >> integration points necessary to support this feature, and use accord
> as the
> > >> first implementation of these integration points.
> > >>
> > >> Em ter., 14 de set. de 2021 às 17:57, Paulo Motta <
> > >> pauloricard...@gmail.com>
> > >> escreveu:
> > >>
> > >> Given the extensiveness and complexity of the proposal I'd suggest
> leaving
> > >> it a little longer (perhaps 4 weeks from the publish date?) for
> people to
> > >> get a bit more familiarized and have the chance to comment before
> casting a
> > >> vote. I glanced through the proposal - and it looks outstanding, very
> > >> promising work guys! - but would like a bit more time to take a
> deeper look
> > >> and digest it before potentially commenting on it.
> > >>
> > >> Em ter., 14 de set. de 2021 às 17:30, bened...@apache.org <
> > >> bened...@apache.org> escreveu:
> > >>
> > >> Has anyone had a chance to read the drafts, and has any feedback or
> > >> questions? Does anybody still anticipate doing so in the near future?
> Or
> > >> shall we move to a vote?
> > >>
> > >> From: bened...@apache.org <bened...@apache.org>
> > >> Date: Tuesday, 7 September 2021 at 21:27
> > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Hi Jake,
> > >>
> > >>> What structural changes are planned to support an external dependency
> > >> project like this
> > >>
> > >> To add to Blake’s answer, in case there’s some confusion over this,
> the
> > >> proposal is to include this library within the Apache Cassandra
> project. So
> > >> I wouldn’t think of it as an external dependency. This PMC and
> community
> > >> will still have the usual oversight over direction and development,
> and
> > >> APIs will be developed solely with the intention of their integration
> with
> > >> Cassandra.
> > >>
> > >>> Will this effort eventually replace consistency levels in C*?
> > >>
> > >> I hope we’ll have some very related discussions around consistency
> levels
> > >> in the coming months more generally, but I don’t think that is tightly
> > >> coupled to this work. I agree with you both that we won’t want to
> > >> perpetuate the problems you’ve highlighted though.
> > >>
> > >> Henrik:
> > >>> I was referring to the property that Calvin transactions also need to
> > >> be sent to the cluster in a single shot
> > >>
> > >> Ah, yes. In that case I agree, and I tried to point to this direction
> in
> > >> an earlier email, where I discussed the use of scripting languages
> (i.e.
> > >> transactionally modifying the database with some subset of arbitrary
> > >> computation). I think the JVM is particularly suited to offering quite
> > >> powerful distributed transactions in this vein, and it will be
> interesting
> > >> to see what we might develop in this direction in future.
> > >>
> > >>
> > >> From: Jake Luciani <jak...@gmail.com>
> > >> Date: Tuesday, 7 September 2021 at 19:27
> > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Great thanks for the information
> > >>
> > >> On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
> > >> <beggles...@apple.com.invalid> wrote:
> > >>
> > >>> Hi Jake,
> > >>>
> > >>>> 1. Will this effort eventually replace consistency levels in C*? I
> > >> ask
> > >>>> because one of the shortcomings of our paxos today is
> > >>>> it can be easily mixed with non serialized consistencies and
> therefore
> > >>>> users commonly break consistency by for example reading at CL.ONE
> > >> while
> > >>>> also
> > >>>> using LWTs.
> > >>>
> > >>> This will likely require CLs to be specified at the schema level for
> > >>> tables using multi partition transactions. I’d expect this to be
> > >> available
> > >>> for other tables, but not required.
> > >>>
> > >>>> 2. What structural changes are planned to support an external
> > >> dependency
> > >>>> project like this? Are there some high level interfaces you expect
> > >> the
> > >>>> project to adhere to?
> > >>>
> > >>> There will be some interfaces that need to be implemented in C* to
> > >> support
> > >>> the library. You can find the current interfaces in the accord.api
> > >> package,
> > >>> but these were written to support some initial testing, and not
> intended
> > >>> for integration into C* as is. Things are pretty fluid right now and
> > >> will
> > >>> be rewritten / refactored multiple times over the next few months.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Blake
> > >>>
> > >>>
> > >>>> On Sun, Sep 5, 2021 at 10:33 AM bened...@apache.org <
> > >> bened...@apache.org
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>>> Wiki:
> > >>>>>
> > >>>
> > >>
> > >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>>> Whitepaper:
> > >>>>>
> > >>>
> > >>
> > >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>>> <
> > >>>>>
> > >>>
> > >>
> > >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>>
> > >>>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>>
> > >>>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>>
> > >>>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>>> developers that want to ensure consistency for complex operations
> > >> must
> > >>>>> either accept the scalability bottleneck of serializing all related
> > >>> state
> > >>>>> through a single partition, or layer a complex state machine on top
> > >> of
> > >>> the
> > >>>>> database. These are sophisticated and costly activities that our
> > >> users
> > >>>>> should not be expected to undertake. Since distributed databases
> are
> > >>>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>>> time for Cassandra to do so as well.
> > >>>>>
> > >>>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > >>>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>>> approach we will be the _only_ distributed database to offer
> global,
> > >>>>> scalable, strict serializable transactions in one wide area
> > >> round-trip.
> > >>>>> This would represent a significant improvement in the state of the
> > >> art,
> > >>>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>>
> > >>>>> This work has been partially realised in a prototype. This partial
> > >>>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>>
> > >>>>> I propose including the prototype in the project as a new source
> > >>>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>>> Cassandra. I hope the community sees the important value
> proposition
> > >> of
> > >>>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > >> the
> > >>>>> library and its integration into Cassandra can be developed in
> > >> parallel
> > >>> and
> > >>>>> with the involvement of the wider community.
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> http://twitter.com/tjake
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>>
> > >>>
> > >>
> > >> --
> > >> http://twitter.com/tjake
> > >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to