Re: [DISCUSS] CEP-15: General Purpose Transactions

C. Scott Andreas Thu, 07 Oct 2021 07:52:09 -0700
Hi Jonathan,Following up on my message yesterday as it looks like our replies may 
have crossed en route.Thanks for bumping your message from earlier in our discussion. 
I believe we have addressed most of these questions on the thread, in addition to 
offering a presentation on this and related work at ApacheCon, a discussion hosted 
following that presentation at ApacheCon, and in ASF Slack. Contributors have further 
offered an opportuntity to discuss specific questions via videoconference if it helps 
to speak live. I'd be happy to do so as well.Since your original message, discussion 
has covered a lot of ground on the related databases you've mentioned:– Henrik has 
shared expertise related to MongoDB and its implementation.– You've shared an 
overview of Calvin.– Alex Miller has helped us review the work relative to other 
Paxos algorithms and identified a few great enhancements to incorporate.– The paper 
discusses related approaches in FoundationDB, CockroachDB, and Yugabyte.– Subsequent 
discussion has contrasted the implementation to DynamoDB, Google Cloud BigTable, and 
Google Cloud Spanner (noting specifically that the protocol achieves Spanner's 1x 
round-trip without requiring specialized hardware).In my reply yesterday, I've 
attempted to crystallize what becomes possible via CQL: one-shot multi-partition 
transactions in the first implementation and a 4x latency reduction on writes / 2x 
latency reduction on reads relative to today; along with the ability to build upon 
this work to enable interactive transactions in the future.I believe we've exercised 
the questions you've raised and am grateful for the ground we've covered. If you have 
further questions that are difficult to exercise via email, please let me know if 
you'd like to arrange a call (open-invite); we'd be happy to discuss live as 
well.With the proposal hitting the one-month mark, the contributors are interested in 
gauging the developer community's response to the proposal. We warrant our ability to 
focus durably on the project; execute this development on ASF JIRA in collaboration 
with other contributors; engage with members of the developer and user community on 
feedback, enhancements, and bugs; and intend deliver it to completion at a standard 
of readiness suitable for production transactional systems of record.Thanks,– ScottOn 
Oct 6, 2021, at 8:25 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:Hi 
folks,Thanks for discussion on this proposal, and also to Benedict who’s been 
fielding questions on the list!I’d like to restate the goals and problem statement 
captured by this proposal and frame context.Today, lightweight transactions limit 
users to transacting over a single partition. This unit of atomicity has a very low 
upper limit in terms of the amount of data that can be CAS’d over; and doing so leads 
many to design contorted data models to cram different types of data into one 
partition for the purposes of being able to CAS over it. We propose that Cassandra 
can and should be extended to remove this limit, enabling users to issue one-shot 
transactions that CAS over multiple keys – including CAS batches, which may modify 
multiple keys.To enable this, the CEP authors have designed a novel, leaderless 
paxos-based protocol unique to Cassandra, offered a proof of its correctness, a 
whitepaper outlining it in detail, along with a prototype implementation to incubate 
development, and integrated it with Maelstrom from jepsen.io to validate 
linearizability as more specific test infrastructure is developed. This rigor is 
remarkable, and I’m thrilled to see such a degree of investment in the area.Even 
users who do not require the capability to transact across partition boundaries will 
benefit. The protocol reduces message/WAN round-trips by 4x on writes (4 → 1) and 2x 
on reads (2 → 1) in the common case against today’s baseline. These latency 
improvements coupled with the enhanced flexibility of what can be transacted over in 
Cassandra enable new classes of applications to use the database.In particular, 1xRTT 
read/write transactions across partitions enable Cassandra to be thought of not just 
as a strongly consistent database, but even a transactional database - a mode many 
may even prefer to use by default. Given this capability, Apache Cassandra has an 
opportunity to become one of – or perhaps the only – database in the industry that 
can store multiple petabytes of data in a single database; replicate it across many 
regions; and allow users to transact over any subset of it. These are capabilities 
that can be met by no other system I’m aware of on the market. Dynamo’s transactions 
are single-DC. Google Cloud BigTable does not support transactions. Spanner, Aurora, 
CloudSQL, and RDS have far lower scalability limits or require specialized hardware, 
etc.This is an incredible opportunity for Apache Cassandra - to surpass the 
scalability and transactional capability of some of the most advanced systems in our 
industry - and to do so in open source, where anyone can download and deploy the 
software to achieve this without cost; and for students and researchers to learn from 
and build upon as well (a team from UT-Austin has already reached out to this 
effect).As Benedict and Blake noted, the scope of what’s captured in this proposal is 
also not terminal. While the first implementation may extend today’s CAS semantics to 
multiple partitions with lower latency, the foundation is suitable to build 
interactive transactions as well — which would be remarkable and is something that I 
hadn’t considered myself at the onset of this project.To that end, the CEP proposes 
the protocol, offers a validated implementation, and the initial capability of 
extending today’s single-partition transactions to multi-partition; while providing 
the flexibility to build upon this work further.A simple example of what becomes 
possible when this work lands and is integrated might be:–––
BEGIN BATCHUPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1UPDATE 
tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND conditionValue = 
someConditionAPPLY BATCH
–––I understand that this query is present in the CEP and my intent isn’t to recommend that folks reread it if they’ve given a careful reading already. 
But I do think it’s important to elaborate upon what becomes possible when this query can be issued.Users of Cassandra who have designed data models that 
cram many types of data into a single partition for the purposes of atomicity no longer need to. They can design their applications with appropriate 
schemas that wouldn’t leave Codd holding his nose. They’re no longer pushed into antipatterns that result in these partitions becoming huge and 
potentially unreadable. Cassandra doesn’t become fully relational in this CEP - but it becomes possible and even easy to design applications that transact 
across tables that mimic a large amount of relational functionality. And for users who are content to transact over a single table, they’ll find those 
transactions become up to 4x faster today due to the protocol’s reduction in round-trips. The library’s loose coupling to Apache Cassandra and ability to 
be incubated out-of-tree also enables other applications to take advantage of the protocol and is a nice step toward bringing modularity to the project. 
There are a lot of good things happening here.I know I’m listed as an author - but figured I should go on record to say “I support this CEP.” :)Thanks,– 
ScottOn Oct 6, 2021, at 8:05 AM, Jonathan Ellis <jbel...@gmail.com> wrote:The problem that I keep pointing out is that you've created this CEP 
forAccord without first getting consensus that the goals and the tradeoffs itmakes to achieve those goals (and that it will impose on future work 
aroundtransactions) are the right ones for Cassandra long term.At this point I'm done repeating myself.  For the convenience of anyonefollowing this 
thread intermittently, I'll quote my first reply on thisthread to illustrate the kind of discussion I'd like to have.-----The whitepaper here is a good 
description of the consensus algorithm itselfas well as its robustness and stability characteristics, and its comparisonwith other state-of-the-art 
consensus algorithms is very useful.  In thecontext of Cassandra, where a consensus algorithm is only part of what willbe implemented, I'd like to see a 
more complete evaluation of thetransactional side of things as well, including performance characteristicsas well as the types of transactions that can be 
supported and at least ageneral idea of what it would look like applied to Cassandra. This willallow the PMC to make a more informed decision about what 
tradeoffs arebest for the entire long-term project of first supplementing and ultimatelyreplacing LWT.(Allowing users to mix LWT and AP Cassandra 
operations against the samerows was probably a mistake, so in contrast with LWT we’re not looking forsomething fast enough for occasional use but rather 
something within areasonable factor of AP operations, appropriate to being the only way tointeract with tables declared as such.)Besides Accord, this 
should cover- Calvin and FaunaDB- A Spanner derivative (no opinion on whether that should be Cockroach orYugabyte, I don’t think it’s necessary to cover 
both)- A 2PC implementation (the Accord paper mentions DynamoDB but I suspectthere is more public information about MongoDB)- RAMPHere’s an example of 
what I mean:=Calvin=Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to ordertransactions, then replicas execute the transactions 
independently with nofurther coordination.  No SPOF.  Transactions are batched by each sequencerto keep this from becoming a bottleneck.Performance: 
Calvin paper (published 2012) reports linear scaling of TPC-CNew Order up to 500,000 transactions/s on 100 machines (EC2 XL machineswith 7GB ram and 8 
virtual cores).  Note that TPC-C New Order is composedof four reads and four writes, so this is effectively 2M reads and 2Mwrites as we normally measure 
them in C*.Calvin supports mixed read/write transactions, but because the transactionexecution logic requires knowing all partition keys in advance to 
ensurethat all replicas can reproduce the same results with no coordination,reads against non-PK predicates must be done ahead of time (transparently,by 
the server) to determine the set of keys, and this must be retried ifthe set of rows affected is updated before the actual transaction executes.Batching 
and global consensus adds latency -- 100ms in the Calvin paper andapparently about 50ms in FaunaDB.  Glass half full: all transactions(including 
multi-partition updates) are equally performant in Calvin sincethe coordination is handled up front in the sequencing step.  Glass halfempty: even 
single-row reads and writes have to pay the full coordinationcost.  Fauna has optimized this away for reads but I am not aware of adescription of how they 
changed the design to allow this.Functionality and limitations: since the entire transaction must be knownin advance to allow coordination-less execution 
at the replicas, Calvincannot support interactive transactions at all.  FaunaDB mitigates this byallowing server-side logic to be included, but a Calvin 
approach will neverbe able to offer SQL compatibility.Guarantees: Calvin transactions are strictly serializable.  There is noadditional complexity or 
performance hit to generalizing to multipleregions, apart from the speed of light.  And since Calvin is already payinga batching latency penalty, this is 
less painful than for other systems.Application to Cassandra: B-.  Distributed transactions are handled by thesequencing and scheduling layers, which are 
leaderless, and Calvin’srequirements for the storage layer are easily met by C*.  But Calvin alsorequires a global consensus protocol and LWT is almost 
certainly notsufficiently performant, so this would require ZK or etcd (reasonable for alibrary approach but not for replacing LWT in C* itself), or 
animplementation of Accord.  I don’t believe Calvin would require additionaltable-level metadata in Cassandra.On Wed, Oct 6, 2021 at 9:53 AM 
bened...@apache.org <bened...@apache.org>wrote:The problem with dropping a patch on Jira is that there is no opportunityto point out problems, 
either with the fundamental approach or with thespecific implementation. So please point out some problems I can engagewith!From: Jonathan Ellis 
<jbel...@gmail.com>Date: Wednesday, 6 October 2021 at 15:48To: dev <dev@cassandra.apache.org>Subject: Re: [DISCUSS] CEP-15: General Purpose 
TransactionsOn Wed, Oct 6, 2021 at 9:21 AM bened...@apache.org <bened...@apache.org>wrote:> The goals of the CEP are stated clearly, and these 
were the goals we had> going into the (multi-month) research project we undertook beforeproposing> this CEP. These goals are necessarily value 
judgements, so we cannotexpect> that everyone will agree that they are optimal.>Right, so I'm saying that this is exactly the most important thing 
to getconsensus on, and creating a CEP for a protocol to achieve goals that youhave not discussed with the community is the CEP equivalent of dropping 
apatch on Jira without discussing its goals either.That's why our conversations haven't gone anywhere, because I keep saying"we need discuss the 
goals and tradeoffs", and I'll give an example of whatI mean, and you keep addressing the examples (sometimes very shallowly, "itwould be 
possible to X" or "Y could be done as an optimization") whileignoring the request to open a discussion around the big picture.-- Jonathan 
Ellisco-founder, http://www.datastax.com@spyced
Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to