Lacking the most basic support for multi-partition transactions is a serious handicap. The CEP offers a concrete solution.
It’s possible to solve multi-partition transactions in a myriad of other ways, I’m sure, but CEP-15 is what’s on offer for Cassandra at the moment, and I’m not seeing any alternative CEPs with folks lined up to implement them. The CEP is a clear and meaningful improvement over status quo. The engineers behind it are committed to doing the implementation work and can be trusted to stick around for maintenance. It’s been a month now, please, let’s get this going. > On 11 Oct 2021, at 13:43, bened...@apache.org wrote: > > For those who missed it, my talk discussing this CEP at ApacheCon is now > available to view: https://www.youtube.com/watch?v=YAE7E-QEAvk > > > > From: Oleksandr Petrov <oleksandr.pet...@gmail.com> > Date: Monday, 11 October 2021 at 10:11 > To: dev <dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions >> I support this proposal. From what I can understand, this proposal moves > us towards having the building blocks we need to correctly deliver some of > the most often requested features in Cassandra. > > Same here. I also support this proposal and believe it opens up many new > opportunities (while not limiting us / not narrowing our future options), > can help us implement features we've all wanted to have implemented for > years, and make significant improvements in the subsystems that were a > source of issues for a long time. > > I think it's also good to start with CAS batches: it's a great way to make > the feature available and work incrementally. After this lands, people will > be able to use Accord/MPT in different subsystems and get busy > implementing all sorts of other features and improvements on top of it. > > > > > On Sat, Oct 9, 2021 at 4:18 PM Joseph Lynch <joe.e.ly...@gmail.com> wrote: > >>> With the proposal hitting the one-month mark, the contributors are >> interested in gauging the developer community's response to the proposal. >> >> I support this proposal. From what I can understand, this proposal >> moves us towards having the building blocks we need to correctly >> deliver some of the most often requested features in Cassandra. For >> example it seems to unlock: batches that actually work, registers that >> offer fast compare and swap, global secondary indices that can be >> correctly maintained, and more. Therefore, given the benefit to the >> community, I support working towards that foundation that will allow >> us to build solutions in Cassandra that pay consensus closer to >> mutation instead of lazily at read/repair time. >> >> I think the feedback in this thread around interface (what statements >> will this facilitate and how will the library integrate with Cassandra >> itself), performance (how fast will these transactions be, will we >> offer bounded stale reads, etc ...), and implementation (how does this >> compare/contrast with other consensus approaches) has been >> informative, but at this point I think it makes sense to start trying >> to make incremental progress towards a functional integration to >> discover any remaining areas for improvement. >> >> Cheers and thank you! >> -Joey >> >> >> >> On Thu, Oct 7, 2021 at 10:51 AM C. Scott Andreas <sc...@paradoxica.net> >> wrote: >>> >>> Hi Jonathan, >>> >>> Following up on my message yesterday as it looks like our replies may >> have crossed en route. >>> >>> Thanks for bumping your message from earlier in our discussion. I >> believe we have addressed most of these questions on the thread, in >> addition to offering a presentation on this and related work at ApacheCon, >> a discussion hosted following that presentation at ApacheCon, and in ASF >> Slack. Contributors have further offered an opportuntity to discuss >> specific questions via videoconference if it helps to speak live. I'd be >> happy to do so as well. >>> >>> Since your original message, discussion has covered a lot of ground on >> the related databases you've mentioned: >>> – Henrik has shared expertise related to MongoDB and its implementation. >>> – You've shared an overview of Calvin. >>> – Alex Miller has helped us review the work relative to other Paxos >> algorithms and identified a few great enhancements to incorporate. >>> – The paper discusses related approaches in FoundationDB, CockroachDB, >> and Yugabyte. >>> – Subsequent discussion has contrasted the implementation to DynamoDB, >> Google Cloud BigTable, and Google Cloud Spanner (noting specifically that >> the protocol achieves Spanner's 1x round-trip without requiring specialized >> hardware). >>> >>> In my reply yesterday, I've attempted to crystallize what becomes >> possible via CQL: one-shot multi-partition transactions in the first >> implementation and a 4x latency reduction on writes / 2x latency reduction >> on reads relative to today; along with the ability to build upon this work >> to enable interactive transactions in the future. >>> >>> I believe we've exercised the questions you've raised and am grateful >> for the ground we've covered. If you have further questions that are >> difficult to exercise via email, please let me know if you'd like to >> arrange a call (open-invite); we'd be happy to discuss live as well. >>> >>> With the proposal hitting the one-month mark, the contributors are >> interested in gauging the developer community's response to the proposal. >> We warrant our ability to focus durably on the project; execute this >> development on ASF JIRA in collaboration with other contributors; engage >> with members of the developer and user community on feedback, enhancements, >> and bugs; and intend deliver it to completion at a standard of readiness >> suitable for production transactional systems of record. >>> >>> Thanks, >>> >>> – Scott >>> >>> On Oct 6, 2021, at 8:25 AM, C. Scott Andreas <sc...@paradoxica.net> >> wrote: >>> >>> >>> >>> Hi folks, >>> >>> Thanks for discussion on this proposal, and also to Benedict who’s been >> fielding questions on the list! >>> >>> I’d like to restate the goals and problem statement captured by this >> proposal and frame context. >>> >>> Today, lightweight transactions limit users to transacting over a single >> partition. This unit of atomicity has a very low upper limit in terms of >> the amount of data that can be CAS’d over; and doing so leads many to >> design contorted data models to cram different types of data into one >> partition for the purposes of being able to CAS over it. We propose that >> Cassandra can and should be extended to remove this limit, enabling users >> to issue one-shot transactions that CAS over multiple keys – including CAS >> batches, which may modify multiple keys. >>> >>> To enable this, the CEP authors have designed a novel, leaderless >> paxos-based protocol unique to Cassandra, offered a proof of its >> correctness, a whitepaper outlining it in detail, along with a prototype >> implementation to incubate development, and integrated it with Maelstrom >> from jepsen.io to validate linearizability as more specific test >> infrastructure is developed. This rigor is remarkable, and I’m thrilled to >> see such a degree of investment in the area. >>> >>> Even users who do not require the capability to transact across >> partition boundaries will benefit. The protocol reduces message/WAN >> round-trips by 4x on writes (4 → 1) and 2x on reads (2 → 1) in the common >> case against today’s baseline. These latency improvements coupled with the >> enhanced flexibility of what can be transacted over in Cassandra enable new >> classes of applications to use the database. >>> >>> In particular, 1xRTT read/write transactions across partitions enable >> Cassandra to be thought of not just as a strongly consistent database, but >> even a transactional database - a mode many may even prefer to use by >> default. Given this capability, Apache Cassandra has an opportunity to >> become one of – or perhaps the only – database in the industry that can >> store multiple petabytes of data in a single database; replicate it across >> many regions; and allow users to transact over any subset of it. These are >> capabilities that can be met by no other system I’m aware of on the market. >> Dynamo’s transactions are single-DC. Google Cloud BigTable does not support >> transactions. Spanner, Aurora, CloudSQL, and RDS have far lower scalability >> limits or require specialized hardware, etc. >>> >>> This is an incredible opportunity for Apache Cassandra - to surpass the >> scalability and transactional capability of some of the most advanced >> systems in our industry - and to do so in open source, where anyone can >> download and deploy the software to achieve this without cost; and for >> students and researchers to learn from and build upon as well (a team from >> UT-Austin has already reached out to this effect). >>> >>> As Benedict and Blake noted, the scope of what’s captured in this >> proposal is also not terminal. While the first implementation may extend >> today’s CAS semantics to multiple partitions with lower latency, the >> foundation is suitable to build interactive transactions as well — which >> would be remarkable and is something that I hadn’t considered myself at the >> onset of this project. >>> >>> To that end, the CEP proposes the protocol, offers a validated >> implementation, and the initial capability of extending today’s >> single-partition transactions to multi-partition; while providing the >> flexibility to build upon this work further. >>> >>> A simple example of what becomes possible when this work lands and is >> integrated might be: >>> >>> ––– >>> BEGIN BATCH >>> UPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1 >>> UPDATE tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND >> conditionValue = someCondition >>> APPLY BATCH >>> ––– >>> >>> I understand that this query is present in the CEP and my intent isn’t >> to recommend that folks reread it if they’ve given a careful reading >> already. But I do think it’s important to elaborate upon what becomes >> possible when this query can be issued. >>> >>> Users of Cassandra who have designed data models that cram many types of >> data into a single partition for the purposes of atomicity no longer need >> to. They can design their applications with appropriate schemas that >> wouldn’t leave Codd holding his nose. They’re no longer pushed into >> antipatterns that result in these partitions becoming huge and potentially >> unreadable. Cassandra doesn’t become fully relational in this CEP - but it >> becomes possible and even easy to design applications that transact across >> tables that mimic a large amount of relational functionality. And for users >> who are content to transact over a single table, they’ll find those >> transactions become up to 4x faster today due to the protocol’s reduction >> in round-trips. The library’s loose coupling to Apache Cassandra and >> ability to be incubated out-of-tree also enables other applications to take >> advantage of the protocol and is a nice step toward bringing modularity to >> the project. There are a lot of good things happening here. >>> >>> I know I’m listed as an author - but figured I should go on record to >> say “I support this CEP.” :) >>> >>> Thanks, >>> >>> – Scott >>> >>> On Oct 6, 2021, at 8:05 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> >>> The problem that I keep pointing out is that you've created this CEP for >>> Accord without first getting consensus that the goals and the tradeoffs >> it >>> makes to achieve those goals (and that it will impose on future work >> around >>> transactions) are the right ones for Cassandra long term. >>> >>> At this point I'm done repeating myself. For the convenience of anyone >>> following this thread intermittently, I'll quote my first reply on this >>> thread to illustrate the kind of discussion I'd like to have. >>> >>> ----- >>> >>> The whitepaper here is a good description of the consensus algorithm >> itself >>> as well as its robustness and stability characteristics, and its >> comparison >>> with other state-of-the-art consensus algorithms is very useful. In the >>> context of Cassandra, where a consensus algorithm is only part of what >> will >>> be implemented, I'd like to see a more complete evaluation of the >>> transactional side of things as well, including performance >> characteristics >>> as well as the types of transactions that can be supported and at least a >>> general idea of what it would look like applied to Cassandra. This will >>> allow the PMC to make a more informed decision about what tradeoffs are >>> best for the entire long-term project of first supplementing and >> ultimately >>> replacing LWT. >>> >>> (Allowing users to mix LWT and AP Cassandra operations against the same >>> rows was probably a mistake, so in contrast with LWT we’re not looking >> for >>> something fast enough for occasional use but rather something within a >>> reasonable factor of AP operations, appropriate to being the only way to >>> interact with tables declared as such.) >>> >>> Besides Accord, this should cover >>> >>> - Calvin and FaunaDB >>> - A Spanner derivative (no opinion on whether that should be Cockroach or >>> Yugabyte, I don’t think it’s necessary to cover both) >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect >>> there is more public information about MongoDB) >>> - RAMP >>> >>> Here’s an example of what I mean: >>> >>> =Calvin= >>> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order >>> transactions, then replicas execute the transactions independently with >> no >>> further coordination. No SPOF. Transactions are batched by each sequencer >>> to keep this from becoming a bottleneck. >>> >>> Performance: Calvin paper (published 2012) reports linear scaling of >> TPC-C >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines >>> with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed >>> of four reads and four writes, so this is effectively 2M reads and 2M >>> writes as we normally measure them in C*. >>> >>> Calvin supports mixed read/write transactions, but because the >> transaction >>> execution logic requires knowing all partition keys in advance to ensure >>> that all replicas can reproduce the same results with no coordination, >>> reads against non-PK predicates must be done ahead of time >> (transparently, >>> by the server) to determine the set of keys, and this must be retried if >>> the set of rows affected is updated before the actual transaction >> executes. >>> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper >> and >>> apparently about 50ms in FaunaDB. Glass half full: all transactions >>> (including multi-partition updates) are equally performant in Calvin >> since >>> the coordination is handled up front in the sequencing step. Glass half >>> empty: even single-row reads and writes have to pay the full coordination >>> cost. Fauna has optimized this away for reads but I am not aware of a >>> description of how they changed the design to allow this. >>> >>> Functionality and limitations: since the entire transaction must be known >>> in advance to allow coordination-less execution at the replicas, Calvin >>> cannot support interactive transactions at all. FaunaDB mitigates this by >>> allowing server-side logic to be included, but a Calvin approach will >> never >>> be able to offer SQL compatibility. >>> >>> Guarantees: Calvin transactions are strictly serializable. There is no >>> additional complexity or performance hit to generalizing to multiple >>> regions, apart from the speed of light. And since Calvin is already >> paying >>> a batching latency penalty, this is less painful than for other systems. >>> >>> Application to Cassandra: B-. Distributed transactions are handled by the >>> sequencing and scheduling layers, which are leaderless, and Calvin’s >>> requirements for the storage layer are easily met by C*. But Calvin also >>> requires a global consensus protocol and LWT is almost certainly not >>> sufficiently performant, so this would require ZK or etcd (reasonable >> for a >>> library approach but not for replacing LWT in C* itself), or an >>> implementation of Accord. I don’t believe Calvin would require additional >>> table-level metadata in Cassandra. >>> >>> On Wed, Oct 6, 2021 at 9:53 AM bened...@apache.org <bened...@apache.org> >>> wrote: >>> >>> The problem with dropping a patch on Jira is that there is no opportunity >>> to point out problems, either with the fundamental approach or with the >>> specific implementation. So please point out some problems I can engage >>> with! >>> >>> >>> From: Jonathan Ellis <jbel...@gmail.com> >>> Date: Wednesday, 6 October 2021 at 15:48 >>> To: dev <dev@cassandra.apache.org> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions >>> On Wed, Oct 6, 2021 at 9:21 AM bened...@apache.org <bened...@apache.org> >>> wrote: >>> >>>> The goals of the CEP are stated clearly, and these were the goals we >> had >>>> going into the (multi-month) research project we undertook before >>> proposing >>>> this CEP. These goals are necessarily value judgements, so we cannot >>> expect >>>> that everyone will agree that they are optimal. >>>> >>> >>> Right, so I'm saying that this is exactly the most important thing to get >>> consensus on, and creating a CEP for a protocol to achieve goals that you >>> have not discussed with the community is the CEP equivalent of dropping a >>> patch on Jira without discussing its goals either. >>> >>> That's why our conversations haven't gone anywhere, because I keep saying >>> "we need discuss the goals and tradeoffs", and I'll give an example of >> what >>> I mean, and you keep addressing the examples (sometimes very shallowly, >> "it >>> would be possible to X" or "Y could be done as an optimization") while >>> ignoring the request to open a discussion around the big picture. >>> >>> >>> >>> -- >>> Jonathan Ellis >>> co-founder, http://www.datastax.com >>> @spyced >>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> > > -- > alex p --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org