Re: [DISCUSS] CEP-15: General Purpose Transactions

Aleksey Yeschenko Mon, 11 Oct 2021 09:08:00 -0700

Lacking the most basic support for multi-partition transactions is a serious 
handicap. The CEP offers a concrete solution.


It’s possible to solve multi-partition transactions in a myriad of other ways, 
I’m sure, but CEP-15 is what’s on offer for Cassandra at the moment, and I’m 
not seeing any alternative CEPs with folks lined up to implement them.

The CEP is a clear and meaningful improvement over status quo. The engineers 
behind it are committed to doing the implementation work and can be trusted to 
stick around for maintenance. It’s been a month now, please, let’s get this 
going.

> On 11 Oct 2021, at 13:43, bened...@apache.org wrote:
> 
> For those who missed it, my talk discussing this CEP at ApacheCon is now 
> available to view:  https://www.youtube.com/watch?v=YAE7E-QEAvk
> 
> 
> 
> From: Oleksandr Petrov <oleksandr.pet...@gmail.com>
> Date: Monday, 11 October 2021 at 10:11
> To: dev <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> I support this proposal. From what I can understand, this proposal  moves
> us towards having the building blocks we need to correctly deliver some of
> the most often requested features in Cassandra.
> 
> Same here. I also support this proposal and believe it opens up many new
> opportunities (while not limiting us / not narrowing our future options),
> can help us implement features we've all wanted to have implemented for
> years, and make significant improvements in the subsystems that were a
> source of issues for a long time.
> 
> I think it's also good to start with CAS batches: it's a great way to make
> the feature available and work incrementally. After this lands, people will
> be able to use Accord/MPT in different subsystems and get busy
> implementing all sorts of other features and improvements on top of it.
> 
> 
> 
> 
> On Sat, Oct 9, 2021 at 4:18 PM Joseph Lynch <joe.e.ly...@gmail.com> wrote:
> 
>>> With the proposal hitting the one-month mark, the contributors are
>> interested in gauging the developer community's response to the proposal.
>> 
>> I support this proposal. From what I can understand, this proposal
>> moves us towards having the building blocks we need to correctly
>> deliver some of the most often requested features in Cassandra. For
>> example it seems to unlock: batches that actually work, registers that
>> offer fast compare and swap, global secondary indices that can be
>> correctly maintained, and more. Therefore, given the benefit to the
>> community, I support working towards that foundation that will allow
>> us to build solutions in Cassandra that pay consensus closer to
>> mutation instead of lazily at read/repair time.
>> 
>> I think the feedback in this thread around interface (what statements
>> will this facilitate and how will the library integrate with Cassandra
>> itself), performance (how fast will these transactions be, will we
>> offer bounded stale reads, etc ...), and implementation (how does this
>> compare/contrast with other consensus approaches) has been
>> informative, but at this point I think it makes sense to start trying
>> to make incremental progress towards a functional integration to
>> discover any remaining areas for improvement.
>> 
>> Cheers and thank you!
>> -Joey
>> 
>> 
>> 
>> On Thu, Oct 7, 2021 at 10:51 AM C. Scott Andreas <sc...@paradoxica.net>
>> wrote:
>>> 
>>> Hi Jonathan,
>>> 
>>> Following up on my message yesterday as it looks like our replies may
>> have crossed en route.
>>> 
>>> Thanks for bumping your message from earlier in our discussion. I
>> believe we have addressed most of these questions on the thread, in
>> addition to offering a presentation on this and related work at ApacheCon,
>> a discussion hosted following that presentation at ApacheCon, and in ASF
>> Slack. Contributors have further offered an opportuntity to discuss
>> specific questions via videoconference if it helps to speak live. I'd be
>> happy to do so as well.
>>> 
>>> Since your original message, discussion has covered a lot of ground on
>> the related databases you've mentioned:
>>> – Henrik has shared expertise related to MongoDB and its implementation.
>>> – You've shared an overview of Calvin.
>>> – Alex Miller has helped us review the work relative to other Paxos
>> algorithms and identified a few great enhancements to incorporate.
>>> – The paper discusses related approaches in FoundationDB, CockroachDB,
>> and Yugabyte.
>>> – Subsequent discussion has contrasted the implementation to DynamoDB,
>> Google Cloud BigTable, and Google Cloud Spanner (noting specifically that
>> the protocol achieves Spanner's 1x round-trip without requiring specialized
>> hardware).
>>> 
>>> In my reply yesterday, I've attempted to crystallize what becomes
>> possible via CQL: one-shot multi-partition transactions in the first
>> implementation and a 4x latency reduction on writes / 2x latency reduction
>> on reads relative to today; along with the ability to build upon this work
>> to enable interactive transactions in the future.
>>> 
>>> I believe we've exercised the questions you've raised and am grateful
>> for the ground we've covered. If you have further questions that are
>> difficult to exercise via email, please let me know if you'd like to
>> arrange a call (open-invite); we'd be happy to discuss live as well.
>>> 
>>> With the proposal hitting the one-month mark, the contributors are
>> interested in gauging the developer community's response to the proposal.
>> We warrant our ability to focus durably on the project; execute this
>> development on ASF JIRA in collaboration with other contributors; engage
>> with members of the developer and user community on feedback, enhancements,
>> and bugs; and intend deliver it to completion at a standard of readiness
>> suitable for production transactional systems of record.
>>> 
>>> Thanks,
>>> 
>>> – Scott
>>> 
>>> On Oct 6, 2021, at 8:25 AM, C. Scott Andreas <sc...@paradoxica.net>
>> wrote:
>>> 
>>> 
>>> 
>>> Hi folks,
>>> 
>>> Thanks for discussion on this proposal, and also to Benedict who’s been
>> fielding questions on the list!
>>> 
>>> I’d like to restate the goals and problem statement captured by this
>> proposal and frame context.
>>> 
>>> Today, lightweight transactions limit users to transacting over a single
>> partition. This unit of atomicity has a very low upper limit in terms of
>> the amount of data that can be CAS’d over; and doing so leads many to
>> design contorted data models to cram different types of data into one
>> partition for the purposes of being able to CAS over it. We propose that
>> Cassandra can and should be extended to remove this limit, enabling users
>> to issue one-shot transactions that CAS over multiple keys – including CAS
>> batches, which may modify multiple keys.
>>> 
>>> To enable this, the CEP authors have designed a novel, leaderless
>> paxos-based protocol unique to Cassandra, offered a proof of its
>> correctness, a whitepaper outlining it in detail, along with a prototype
>> implementation to incubate development, and integrated it with Maelstrom
>> from jepsen.io to validate linearizability as more specific test
>> infrastructure is developed. This rigor is remarkable, and I’m thrilled to
>> see such a degree of investment in the area.
>>> 
>>> Even users who do not require the capability to transact across
>> partition boundaries will benefit. The protocol reduces message/WAN
>> round-trips by 4x on writes (4 → 1) and 2x on reads (2 → 1) in the common
>> case against today’s baseline. These latency improvements coupled with the
>> enhanced flexibility of what can be transacted over in Cassandra enable new
>> classes of applications to use the database.
>>> 
>>> In particular, 1xRTT read/write transactions across partitions enable
>> Cassandra to be thought of not just as a strongly consistent database, but
>> even a transactional database - a mode many may even prefer to use by
>> default. Given this capability, Apache Cassandra has an opportunity to
>> become one of – or perhaps the only – database in the industry that can
>> store multiple petabytes of data in a single database; replicate it across
>> many regions; and allow users to transact over any subset of it. These are
>> capabilities that can be met by no other system I’m aware of on the market.
>> Dynamo’s transactions are single-DC. Google Cloud BigTable does not support
>> transactions. Spanner, Aurora, CloudSQL, and RDS have far lower scalability
>> limits or require specialized hardware, etc.
>>> 
>>> This is an incredible opportunity for Apache Cassandra - to surpass the
>> scalability and transactional capability of some of the most advanced
>> systems in our industry - and to do so in open source, where anyone can
>> download and deploy the software to achieve this without cost; and for
>> students and researchers to learn from and build upon as well (a team from
>> UT-Austin has already reached out to this effect).
>>> 
>>> As Benedict and Blake noted, the scope of what’s captured in this
>> proposal is also not terminal. While the first implementation may extend
>> today’s CAS semantics to multiple partitions with lower latency, the
>> foundation is suitable to build interactive transactions as well — which
>> would be remarkable and is something that I hadn’t considered myself at the
>> onset of this project.
>>> 
>>> To that end, the CEP proposes the protocol, offers a validated
>> implementation, and the initial capability of extending today’s
>> single-partition transactions to multi-partition; while providing the
>> flexibility to build upon this work further.
>>> 
>>> A simple example of what becomes possible when this work lands and is
>> integrated might be:
>>> 
>>> –––
>>> BEGIN BATCH
>>> UPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1
>>> UPDATE tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND
>> conditionValue = someCondition
>>> APPLY BATCH
>>> –––
>>> 
>>> I understand that this query is present in the CEP and my intent isn’t
>> to recommend that folks reread it if they’ve given a careful reading
>> already. But I do think it’s important to elaborate upon what becomes
>> possible when this query can be issued.
>>> 
>>> Users of Cassandra who have designed data models that cram many types of
>> data into a single partition for the purposes of atomicity no longer need
>> to. They can design their applications with appropriate schemas that
>> wouldn’t leave Codd holding his nose. They’re no longer pushed into
>> antipatterns that result in these partitions becoming huge and potentially
>> unreadable. Cassandra doesn’t become fully relational in this CEP - but it
>> becomes possible and even easy to design applications that transact across
>> tables that mimic a large amount of relational functionality. And for users
>> who are content to transact over a single table, they’ll find those
>> transactions become up to 4x faster today due to the protocol’s reduction
>> in round-trips. The library’s loose coupling to Apache Cassandra and
>> ability to be incubated out-of-tree also enables other applications to take
>> advantage of the protocol and is a nice step toward bringing modularity to
>> the project. There are a lot of good things happening here.
>>> 
>>> I know I’m listed as an author - but figured I should go on record to
>> say “I support this CEP.” :)
>>> 
>>> Thanks,
>>> 
>>> – Scott
>>> 
>>> On Oct 6, 2021, at 8:05 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>> 
>>> 
>>> The problem that I keep pointing out is that you've created this CEP for
>>> Accord without first getting consensus that the goals and the tradeoffs
>> it
>>> makes to achieve those goals (and that it will impose on future work
>> around
>>> transactions) are the right ones for Cassandra long term.
>>> 
>>> At this point I'm done repeating myself. For the convenience of anyone
>>> following this thread intermittently, I'll quote my first reply on this
>>> thread to illustrate the kind of discussion I'd like to have.
>>> 
>>> -----
>>> 
>>> The whitepaper here is a good description of the consensus algorithm
>> itself
>>> as well as its robustness and stability characteristics, and its
>> comparison
>>> with other state-of-the-art consensus algorithms is very useful. In the
>>> context of Cassandra, where a consensus algorithm is only part of what
>> will
>>> be implemented, I'd like to see a more complete evaluation of the
>>> transactional side of things as well, including performance
>> characteristics
>>> as well as the types of transactions that can be supported and at least a
>>> general idea of what it would look like applied to Cassandra. This will
>>> allow the PMC to make a more informed decision about what tradeoffs are
>>> best for the entire long-term project of first supplementing and
>> ultimately
>>> replacing LWT.
>>> 
>>> (Allowing users to mix LWT and AP Cassandra operations against the same
>>> rows was probably a mistake, so in contrast with LWT we’re not looking
>> for
>>> something fast enough for occasional use but rather something within a
>>> reasonable factor of AP operations, appropriate to being the only way to
>>> interact with tables declared as such.)
>>> 
>>> Besides Accord, this should cover
>>> 
>>> - Calvin and FaunaDB
>>> - A Spanner derivative (no opinion on whether that should be Cockroach or
>>> Yugabyte, I don’t think it’s necessary to cover both)
>>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
>>> there is more public information about MongoDB)
>>> - RAMP
>>> 
>>> Here’s an example of what I mean:
>>> 
>>> =Calvin=
>>> 
>>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
>>> transactions, then replicas execute the transactions independently with
>> no
>>> further coordination. No SPOF. Transactions are batched by each sequencer
>>> to keep this from becoming a bottleneck.
>>> 
>>> Performance: Calvin paper (published 2012) reports linear scaling of
>> TPC-C
>>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
>>> with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed
>>> of four reads and four writes, so this is effectively 2M reads and 2M
>>> writes as we normally measure them in C*.
>>> 
>>> Calvin supports mixed read/write transactions, but because the
>> transaction
>>> execution logic requires knowing all partition keys in advance to ensure
>>> that all replicas can reproduce the same results with no coordination,
>>> reads against non-PK predicates must be done ahead of time
>> (transparently,
>>> by the server) to determine the set of keys, and this must be retried if
>>> the set of rows affected is updated before the actual transaction
>> executes.
>>> 
>>> Batching and global consensus adds latency -- 100ms in the Calvin paper
>> and
>>> apparently about 50ms in FaunaDB. Glass half full: all transactions
>>> (including multi-partition updates) are equally performant in Calvin
>> since
>>> the coordination is handled up front in the sequencing step. Glass half
>>> empty: even single-row reads and writes have to pay the full coordination
>>> cost. Fauna has optimized this away for reads but I am not aware of a
>>> description of how they changed the design to allow this.
>>> 
>>> Functionality and limitations: since the entire transaction must be known
>>> in advance to allow coordination-less execution at the replicas, Calvin
>>> cannot support interactive transactions at all. FaunaDB mitigates this by
>>> allowing server-side logic to be included, but a Calvin approach will
>> never
>>> be able to offer SQL compatibility.
>>> 
>>> Guarantees: Calvin transactions are strictly serializable. There is no
>>> additional complexity or performance hit to generalizing to multiple
>>> regions, apart from the speed of light. And since Calvin is already
>> paying
>>> a batching latency penalty, this is less painful than for other systems.
>>> 
>>> Application to Cassandra: B-. Distributed transactions are handled by the
>>> sequencing and scheduling layers, which are leaderless, and Calvin’s
>>> requirements for the storage layer are easily met by C*. But Calvin also
>>> requires a global consensus protocol and LWT is almost certainly not
>>> sufficiently performant, so this would require ZK or etcd (reasonable
>> for a
>>> library approach but not for replacing LWT in C* itself), or an
>>> implementation of Accord. I don’t believe Calvin would require additional
>>> table-level metadata in Cassandra.
>>> 
>>> On Wed, Oct 6, 2021 at 9:53 AM bened...@apache.org <bened...@apache.org>
>>> wrote:
>>> 
>>> The problem with dropping a patch on Jira is that there is no opportunity
>>> to point out problems, either with the fundamental approach or with the
>>> specific implementation. So please point out some problems I can engage
>>> with!
>>> 
>>> 
>>> From: Jonathan Ellis <jbel...@gmail.com>
>>> Date: Wednesday, 6 October 2021 at 15:48
>>> To: dev <dev@cassandra.apache.org>
>>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>>> On Wed, Oct 6, 2021 at 9:21 AM bened...@apache.org <bened...@apache.org>
>>> wrote:
>>> 
>>>> The goals of the CEP are stated clearly, and these were the goals we
>> had
>>>> going into the (multi-month) research project we undertook before
>>> proposing
>>>> this CEP. These goals are necessarily value judgements, so we cannot
>>> expect
>>>> that everyone will agree that they are optimal.
>>>> 
>>> 
>>> Right, so I'm saying that this is exactly the most important thing to get
>>> consensus on, and creating a CEP for a protocol to achieve goals that you
>>> have not discussed with the community is the CEP equivalent of dropping a
>>> patch on Jira without discussing its goals either.
>>> 
>>> That's why our conversations haven't gone anywhere, because I keep saying
>>> "we need discuss the goals and tradeoffs", and I'll give an example of
>> what
>>> I mean, and you keep addressing the examples (sometimes very shallowly,
>> "it
>>> would be possible to X" or "Y could be done as an optimization") while
>>> ignoring the request to open a discussion around the big picture.
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>> 
>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> 
> 
> --
> alex p


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to