Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread bened...@apache.org
Sure, that works for me.

From: Patrick McFadin 
Date: Wednesday, 22 September 2021 at 04:47
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I would be happy to host a Zoom as I've done in the past. I can post a
transcript and the recording after the call.

Instead of right after your talk Benedict, maybe we can set a time for next
week and let everyone know the time?

Patrick

On Mon, Sep 20, 2021 at 11:05 AM bened...@apache.org 
wrote:

> Hi Joey,
>
> Thanks for the feedback and suggestions.
>
> > I was wondering what do you think about having some extended Q&A after
> your ApacheCon talk Wednesday
>
> I would love to do this. I’ll have to figure out how though – my
> understanding is that I have a hard 40m for my talk and any Q&A, and I
> expect the talk to occupy most of those 40m as I try to cover both the
> CEP-14 and CEP-15. I’m not sure what facilities are made available by
> Hopin, but if necessary we can perhaps post some external video chat link?
>
> The time of day is also a question, as I think the last talk ends at
> 9:20pm local time. But we can make that work if necessary.
>
> > It might help to have a diagram (perhaps I can collaborate with you
> on this?)
>
> I absolutely agree. This is something I had planned to produce but it’s
> been a question of time. In part I wanted to ensure we published long in
> advance of ApacheCon, but now also with CEP-10, CEP-14 and CEP-15 in flight
> it’s hard to get back to improving the draft. If you’d be interested in
> collaborating on this that would be super appreciated, as this would
> certainly help the reader.
>
> >I think that WAN is always paid during the Consensus Protocol, and then
> in most cases execution can remain LAN except in 3+ datacenters where I
> think you'd have to include at least one replica in a neighboring
> datacenter…
>
> As designed the only WAN cost is consensus as Accord ensures every replica
> receives a complete copy of every transaction, and is aware of any gaps. If
> there are gaps there may be WAN delays as those are filled in. This might
> occur because of network outages, but is most likely to occur when
> transactions are being actively executed by multiple DCs at once – in which
> case there’ll be one further unidirectional WAN latency during execution
> while the earlier transaction disseminates its result to the later
> transaction(s). There are other similar scenario we can discuss, e.g. if a
> transaction takes the slow path and will execute after a transaction being
> executed in another DC, that remote transaction needs to receive this
> notification before executing.
>
> There might potentially be some interesting optimisations to make in
> future, where with many queued transactions a single DC may nominate itself
> to execute all outstanding queries and respond to the remote DCs that
> issued them so as to eliminate the WAN latency for disseminating the result
> of each transaction. But we’re getting way ahead of ourselves there 😊
>
> There’s also no LAN cost on write, at least for responding to the client.
> If there is a dependent transaction within the same DC then (as in the
> above case) there will be a LAN penalty for the second transaction to
> execute.
>
> > Relatedly I'm curious if there is any way that the client can
> acquire the timestamp used by the transaction before sending the data
> so we can make the operations idempotent and unrelated to the
> coordinator that was executing them as the storage nodes are
> vulnerable to disk and heap failure modes which makes them much more
> likely to enter grey failure (slow). Alternatively, perhaps it would
> make sense to introduce a set of optional dedicated C* nodes for
> reaching consensus that do not act as storage nodes so we don't have
> to worry about hanging coordinators (join_ring=false?)?
>
> So, in principle coordination can be performed by any node on the network
> including a client – though we’d need to issue the client a unique id this
> can be done cheaply on joining. This might be something to explore in
> future, though there are downsides to having more coordinators too (more
> likely to fail, and stall further transactions that depend on transactions
> it is coordinating).
>
> However, with respect to idempotency, I expect Accord not to perpetuate
> the problems of LWTs where the result of an earlier query is unknown. At
> least success/fail will be maintained in a distributed fashion for some
> reasonable time horizon, and there will also be protection against zombie
> transactions (those proposed to a node that went into a failure spiral
> before reaching healthy nodes, that somehow regurgitates it hours or days
> later), so we should be able to provide practical precisely-once semantics
> to clients.
>
> Whether this is done with a client provided timestamp, or simply some
> other arbitrary client-provided id that can be utilised to deduplicate
> requests or query the status of a transact

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread bened...@apache.org
FWIW I retract this – looking again at the blog post I don’t see adequate 
reason to infer they are using a leaderless approach. On balance I expect Fauna 
is still using a stable leader. Do you have reason to believe they are now 
leaderless?

From: bened...@apache.org 
Date: Wednesday, 22 September 2021 at 04:19
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Demonstrating how subtle, complex and difficult to pin-down this topic is, 
Fauna’s recent blog post implies they may have migrated to a leaderless 
sequencing protocol (an earlier blog post made clear they used a leader 
process). However, Calvin still assumes a global sequencing shard, so this only 
modifies latency for clients, i.e. goal (3). Whether they have also removed 
Calvin’s single-shard linearization of transactions is unclear; there is no 
public information to suggest that they have met goal (1). With this the 
protocol would in essence begin to look a lot like Accord, and perhaps they are 
moving towards a similar approach.


From: bened...@apache.org 
Date: Wednesday, 22 September 2021 at 03:52
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

These other systems are incompatible with the goals of the CEP. I do discuss 
them (besides 2PC) in both the whitepaper and the CEP, and will summarise that 
discussion below. A true and accurate comparison of these other systems is 
essentially intractable, as there are complex subtleties to each flavour, and 
those who are interested would be better served by performing their own 
research.

I think it is more productive to focus on what we want to achieve as a 
community. If you believe the goals of this CEP are wrong for the project, 
let’s focus on that. If you want to compare and contrast specific facets of 
alternative systems that you consider to be preferable in some dimension, let’s 
do that here or in a Q&A as proposed by Joey.

The relevant goals are that we:


  1.  Guarantee strict serializable isolation on commodity hardware
  2.  Scale to any cluster size
  3.  Achieve optimal latency

The approach taken by Spanner derivatives is rejected by (1) because they 
guarantee only Serializable isolation (they additionally fail (3)). From 
watching talks by YugaByte, and inferring from Cockroach’s panic-cluster-death 
under clock skew, this is clearly considered by everyone to be undesirable but 
necessary to achieve scalability.

The approach taken by FaunaDB (Calvin) is rejected by (2) because its 
sequencing layer requires a global leader process for the cluster, which is 
incompatible with Cassandra’s scalability requirements. It additionally fails 
(3) for global clients.

Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a Spanner 
clone for its multi-key transaction functionality, not 2PC.

Systems such as RAMP with even weaker isolation are not considered for the 
simple reason that they do not even claim to meet (1).

If we want to additionally offer weaker isolation levels than Serializable, 
such as that provided by the recent RAMP-TAO paper, Cassandra is likely able to 
support multiple distinct transaction layers that operate independently. I 
would encourage you to file a CEP to explore how we can meet these distinct use 
cases, but I consider them to be niche. I expect that a majority of our user 
base desire strict serializable isolation, and certainly no less than 
serializable isolation, to augment the existing weaker isolation offered by 
quorum reads and writes.

I would tangentially note that we are not an AP database under normal 
recommended operation. A minority in any network partition cannot reach QUORUM, 
so under recommended usage we are a high-availability leaderless CP database.


From: Jonathan Ellis 
Date: Tuesday, 21 September 2021 at 23:45
To: dev 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing user

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread Henrik Ingo
I feel like I should volunteer to write about MongoDB transactions.

TL;DR Snapshot Isolation and Causal Consistency using Raft'ish, Lamport
clock and 2PC. This leads to the age old discussion whether users really
want serializability or not.


On Wed, Sep 22, 2021 at 1:44 AM Jonathan Ellis  wrote:

> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful.  In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
>
=MongoDB=

References:
Presentation: https://www.youtube.com/watch?v=quFheFrLLGQ
Slides:
http://henrikingo.github.io/presentations/HighLoad%202019%20-%20Distributed%20transactions%20top%20to%20bottom/index.html#/step-1
Lamport implementation:
http://delivery.acm.org/10.1145/332/3314049/p636-tyulenev.pdf
Replication: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf and
https://www.usenix.org/system/files/nsdi21-zhou.pdf
TPC-C benchmark: http://www.vldb.org/pvldb/vol12/p2254-kamsky.pdf
(Nothing published on cross shard trx...)

Approach and Guarantees: Shards are independent replica sets, multi shard
transactions and queries handled through a coordinator aka query router.
Replica sets are Raft-like, so leader-based. When using 2PC, also the 2PC
coordinator is a replica set, so that the coordinator state is made durable
via majority commits. This means that a cross shard transaction actually
needs 4 majority commits, but it would be possible to reduce latency to
client ack to 2 commits (https://jira.mongodb.org/browse/SERVER-47130)
Because of this the trx-coordinator is also its own recovery manager and it
is assumed that the replica set will always be able to recover from
failures, usually quickly.

Cluster time is a Lamport clock, in practice the implementation is to
generate use unix timestamp+counter to generate monotonically increasing
integers. Time is passed along each message, and each recipient, updates
its own cluster time to the higher timestamp. All nodes, including clients
participate this. Causal Consistency is basically a client asking to read
at or later than its current timestamp. A replica will block if needed to
satisfy this request. The lamport clock is incremented by leaders to ensure
progress in the absence of write transactions.

The storage engine provides MVCC semantics. Extending this to the
replication system is straightforward, since replicas apply transactions
serially in the same order. For cross shard transactions it's the job of
the transaction coordinator to commit the transaction with the same cluster
time on all shards. If I remember correctly in the 2PC phase it will simply
choose the timestamp returned by each shard as the global transaction
timestamp. Combined, MongoDB transactions are snapshot isolation + causal
consistency.


Performance: 2PC is used only if a transaction actually has multiple
participating shards. It is possible though not fun or realistic to specify
partition boundaries so that related records from two collections will
always reside on the same shard. The 2PC protocol actually requires 4
majority commits, although as of MongoDB 5.0, client only waits for 3.
Majority commit is exactly what QUORUM is in Cassandra, so in a multi-DC
cluster, commit waits for replication latency. Notably, single shard
transactions parallelize well, because conflicting transactions can execute
on the leader, even when the majority commit isn't yet finished. (This
involves some speculative execution optimization.) I don't believe the same
is true for cross shard transactions using 2PC.

The paper by Asya Kamsky uses a single replica set and reports 60-70k TPM
for a non-standard TPC-C where varying client threads was allowed and
schema was modified to ta

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread Henrik Ingo
On Wed, Sep 22, 2021 at 7:56 AM bened...@apache.org 
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.] 


Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread bened...@apache.org
No, I would expect to deliver strict serializable interactive transactions 
using Accord. These would simply corroborate that the participating keys had 
not modified their write timestamps during the final transaction. These could 
even be undertaken with still only a single wide area round-trip, using local 
copies of the data to assemble the transaction (though this would marginally 
increase the chance of aborts)

My goal for MVCC is parallelism, not additional isolation levels (though 
snapshot isolation is useful and we’ll probably also want to offer that 
eventually)

From: Henrik Ingo 
Date: Wednesday, 22 September 2021 at 15:15
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Sep 22, 2021 at 7:56 AM bened...@apache.org 
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.] 


Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-22 Thread bened...@apache.org
Hi everyone,

Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST / 4pm BST 
to discuss Accord and other things in the community. There are no plans to make 
any kind of project decisions. Everyone is welcome to drop in to discuss Accord 
or whatever else might be on your mind.

https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social


From: bened...@apache.org 
Date: Wednesday, 22 September 2021 at 16:22
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
No, I would expect to deliver strict serializable interactive transactions 
using Accord. These would simply corroborate that the participating keys had 
not modified their write timestamps during the final transaction. These could 
even be undertaken with still only a single wide area round-trip, using local 
copies of the data to assemble the transaction (though this would marginally 
increase the chance of aborts)

My goal for MVCC is parallelism, not additional isolation levels (though 
snapshot isolation is useful and we’ll probably also want to offer that 
eventually)

From: Henrik Ingo 
Date: Wednesday, 22 September 2021 at 15:15
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Sep 22, 2021 at 7:56 AM bened...@apache.org 
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.]