Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-20 Thread Miles Garnsey
If Accord can fulfil its aims it sounds like a huge improvement to the state of 
the art in distributed transaction processing. Congrats to all involved in 
pulling the proposal together. 

I was holding off on feedback since this is quite in depth and I don’t want to 
bike shed, I still haven’t spent as much time understanding this as I’d like.

Regardless, I’ll make the following notes in case they’re helpful. My feedback 
is more to satisfy my own curiosity and stimulate discussion than to suggest 
that there are any flaws here. I applaud the proposed testing approach and 
think it is the only way to be certain that the proposed consistency guarantees 
will be upheld.

General 

I’m curious if/how this proposal addresses issues we have seen when scaling; I 
see reference to simple majorities of nodes - is there any plan to ensure 
safety under scaling operations or DC (de)commissioning?

What consistency levels will be supported under Accord? Will it simply be a 
single CL representing a majority of nodes across the whole cluster? (This at 
least would mitigate the issues I’ve seen when folks want to switch from 
EACH_SERIAL to SERIAL).

Accord

> Accord instead assembles an inconsistent set of dependencies. 


Further explanation here would be good. Do we mean to say that the dependancies 
may differ according to which transactions the coordinator has witnessed at the 
time the incoming transaction is first seen? This would make sense if some 
nodes had not fully committed a foregoing transaction. 

Is it correct to think of this step as assembling a dependancy graph of 
foregoing transactions which must be completed ahead of progressing the 
incoming new transaction?

Fast Path

> A coordinator C proposes a timestamp t0 to at least a quorum of a fast path 
> electorate. If t0 is larger than all timestamps witnessed for all prior 
> conflicting transactions, t0 is accepted by a replica. If a fast path quorum 
> of responses accept, the transaction is agreed to execute at t0. Replicas 
> respond with the set of transactions they have witnessed that may execute 
> with a lower timestamp, i.e. those with a lower t0.

What is t0 here? I’m guessing it is the Lamport clock time of the most recent 
mutation to the partition? May be worth clarifying because otherwise the 
perception may be that it is the commencement time of the current transaction 
which may not be the intention.

Regarding the use of logical clocks in general - 

Do we have one clock-per-shard-per-node? Or is there a single clock for all 
transactions on a node?
What happens in network partitions? 
In a cross-shard transaction does maintaining simple majorities of replicas 
protect you from potential inconsistencies arising when a transaction W10 
addressing partitions p1, p2 comes from a different majority (potentially 
isolated due to a network partition) from earlier writes W[1,9] to p1 only? 
It seems that this may cause a sudden change to the dependancy graph for 
partition p2 which may render it vulnerable to strange effects?
Do we consider adversarial cases or any sort of byzantine faults? (That’s a bit 
out of left field, feel free to kick me.)
Why do we prefer Lamport clocks to vector clocks or other types of logical 
clock?

Slow Path

> This value is proposed to at least a simple majority of nodes, along with the 
> union of the dependenciesreceived


Related to the earlier point: when we say `union` here - what set are we 
forming a union over? Is it a union of all dependancies t_n < t as seen by all 
coordinators? I presume that the logic precludes the possibility that these 
dependancies will conflict, since all foregoing transactions which are in 
progress as dependancies must be non-conflicting with earlier transactions in 
the dependancy graph?

In any case, further information about how the dependancy graph is computed 
would be interesting.

> The inclusion of dependencies in the proposal is solely to facilitate 
> Recovery of other transactions that may be incomplete - these are stored on 
> each replica to facilitate decisions at recovery.


Every replica? Or only those participating in the transaction?

> If C fails to reach fast path consensus it takes the highest t it witnessed 
> from its responses, which constitutes a simple Lamport clock value imposing a 
> valid total order. This value is proposed to at least a simple majority of 
> nodes,


When speaking about the simple majority of nodes to whom the max(t) value 
returned will be proposed to - 
It sounds like this need not be the same majority from whom the original sets 
of T_n and dependancies was obtained? 
Is there a proof to show that the dependancies created from the union of the 
first set of replicas resolves to an acceptable dependancy graph for an 
arbitrary majority of replicas? (Especially given that a majority of replicas 
is not a majority of nodes, given we are in a cross-shard scenario here).
What happens in cases where the replica set has change

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-20 Thread Joseph Lynch
Benedict,

Thank you very much for advancing this proposal, I'm extremely excited
to see flexible quorums used in this way and am looking forward to the
integration of Accord into Cassandra! I read the whitepaper and have a
few questions, but I was wondering what do you think about having some
extended Q&A after your ApacheCon talk Wednesday (maybe at the end of
the C* track)? It might be higher bandwidth than going back and forth
on email/slack (also given you're presenting on it that might be a
good time to discuss it)?

Briefly
* It might help to have a diagram (perhaps I can collaborate with you
on this?) showing the happy path delay waiting in the reorder buffer
and the messages that are sent in a 2 and 3 datacenter deployment
during the PreAccept, Accept, Commit, Execute, Apply phases. In
particular it was hard for me to follow where exactly I was paying WAN
latency and where we could achieve progress with LAN only (I think
that WAN is always paid during the Consensus Protocol, and then in
most cases execution can remain LAN except in 3+ datacenters where I
think you'd have to include at least one replica in a neighboring
datacenter). In particular, it seems that Accord always pays clock
skew + WAN latency during the reorder buffer (as part of consensus) +
2x LAN latency during execution (to read and then write).
* Relatedly I'm curious if there is any way that the client can
acquire the timestamp used by the transaction before sending the data
so we can make the operations idempotent and unrelated to the
coordinator that was executing them as the storage nodes are
vulnerable to disk and heap failure modes which makes them much more
likely to enter grey failure (slow). Alternatively, perhaps it would
make sense to introduce a set of optional dedicated C* nodes for
reaching consensus that do not act as storage nodes so we don't have
to worry about hanging coordinators (join_ring=false?)?
* Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should
line 2 read Qt instead of Et?
* I think your claims about clock skew being <1ms in general is
accurate at least for AWS except for when machines boot for the first
time (I can send you some data shortly). It might make sense for
participating members to wait for a minimum detected clock skew before
becoming eligible for electorate?
* I don't really understand how temporarily down replicas will learn
of mutations they missed, did I miss the part where a read replica
would recover all transactions between its last accepted time and
another replica's last accepted time? Or are we just leveraging some
external repair?
* Relatedly since non-transactional reads wouldn't flow through
consensus (I hope) would it make sense for a restarting node to learn
the latest accepted time once and then be deprioritized for all reads
until it has accepted what it missed? Or is the idea that you would
_always_ read transactionally (and since it's a read only transaction
you can skip the WAN consensus and just go straight to fast path
reads)?
* I know the paper says that we elide details of how the shards (aka
replica sets?) are chosen, but it seems that this system would have a
hard dependency on a strongly consistent shard selection system (aka
token metadata?) wouldn't it? In particular if the simple quorums
(which I interpreted to be replica sets in current C*, not sure if
that's correct) can change in non linearizable ways I don't think
Property 3.3 can hold. I think you hint at a solution to this in
section 5 but I'm not sure I grok it.

Super interesting proposal and I am looking forward to all the
improvements this will bring to the project!

Cheers,
-Joey

On Mon, Sep 20, 2021 at 1:34 AM Miles Garnsey
 wrote:
>
> If Accord can fulfil its aims it sounds like a huge improvement to the state 
> of the art in distributed transaction processing. Congrats to all involved in 
> pulling the proposal together.
>
> I was holding off on feedback since this is quite in depth and I don’t want 
> to bike shed, I still haven’t spent as much time understanding this as I’d 
> like.
>
> Regardless, I’ll make the following notes in case they’re helpful. My 
> feedback is more to satisfy my own curiosity and stimulate discussion than to 
> suggest that there are any flaws here. I applaud the proposed testing 
> approach and think it is the only way to be certain that the proposed 
> consistency guarantees will be upheld.
>
> General
>
> I’m curious if/how this proposal addresses issues we have seen when scaling; 
> I see reference to simple majorities of nodes - is there any plan to ensure 
> safety under scaling operations or DC (de)commissioning?
>
> What consistency levels will be supported under Accord? Will it simply be a 
> single CL representing a majority of nodes across the whole cluster? (This at 
> least would mitigate the issues I’ve seen when folks want to switch from 
> EACH_SERIAL to SERIAL).
>
> Accord
>
> > Accord instead assembles an inconsistent set of dependenc

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-20 Thread bened...@apache.org
Hi Miles,

Thanks for the interest and your questions. So far as I can tell from your 
questions you are basing your understanding on the wiki page – in which case I 
would recommend reading the whitepaper, which answers most of your questions. 
Unfortunately I did not understand every question, but I have tried my best to 
respond to every point.

> is there any plan to ensure safety under scaling operations or DC 
> (de)commissioning?

This is a topic that has come up before under regular Paxos. Many of the 
topology changes will be safe for Paxos in the near future, and we certainly 
expect Accord to be safe under all topology changes. Some of this will require 
improvements to Cassandra that colleagues will be proposing more generally in 
the near future, but the whitepaper outlines Accord’s side of the equation in 
section 5.

> What consistency levels will be supported under Accord?

This is related to the above discussion that I expect to be addressed by 
colleagues in the near future. I can discuss my beliefs about how the project 
should move forwards on this, but since I’ve already done so in the past, and I 
don’t think it is core to Accord, I think it is probably best handled in a 
dedicated discussion.

> Further explanation here would be good.

I think on this topic it would be best to consult the whitepaper, as it 
describes dependencies quite precisely better than I can do here. Some 
familiarity with the literature e.g. EPaxos would also help.

In short: there is no dependency graph for any transaction, just a dependency 
set. This is a set of transactions that _may_ execute before us. It includes 
all transactions that _will_ execute before us, but each coordinator for the 
transaction (i.e. if the coordinator fails and another takes its place) may 
assemble a different super-set of the true execution dependencies (i.e. those 
that will actually execute before us)

> What is t0 here?

This is again described precisely in the whitepaper, but it in essence it is 
any value the coordinator would like to propose as an execution timestamp (so 
long as it is globally unique). Normally this is simply (now,coordinator-id).

> Do we have one clock-per-shard-per-node? Or is there a single clock for all 
> transactions on a node?

One per unique global identity, so for simplicity probably per node, but it 
could easily be per shard, or more granular – another global identity would 
simply need to be issued to ensure the logical clock produces globally unique 
values.

> What happens in network partitions?

I’m sorry, I don’t understand the question.

> In a cross-shard transaction does maintaining simple majorities of replicas 
> protect you from potential inconsistencies arising when a transaction W10 
> addressing partitions p1, p2 comes from a different majority (potentially 
> isolated due to a network partition) from earlier writes W[1,9] to p1 only?
It seems that this may cause a sudden change to the dependancy graph for 
partition p2 which may render it vulnerable to strange effects?

I don’t really follow the question, but there are no sudden changes to any 
dependencies, so I think the answer is no.

> Do we consider adversarial cases or any sort of byzantine faults? (That’s a 
> bit out of left field, feel free
to kick me.)

Specified in the paper, and the answer is “no”.

> Why do we prefer Lamport clocks to vector clocks or other types of logical 
> clock?

I’m unsure how to answer this question. Vector clocks are substantially more 
costly, and it’s unclear what they would offer us here. Perhaps you could 
explain why you would consider them, and what other logical clocks you are 
considering?

I will note I did not start from the concept of Lamport clocks, however they 
are a widely understood concept and really Accord boils down to using Lamport 
clocks to derive _some_ total order, and then ensuring that total order is 
durable. This seemed like a simple way to explain the protocol, since it’s only 
a sentence or two.

> Related to the earlier point: when we say `union` here - what set are we 
> forming a union over? Is it a union of all dependancies t_n < t as seen by 
> all coordinators? I presume that the logic precludes the possibility that 
> these dependancies will conflict, since all foregoing transactions which are 
> in progress as dependancies must be non-conflicting with earlier transactions 
> in the dependancy graph?

I don’t understand all of this question, sorry. To the first point, the 
coordinator will receive responses from some fast-path quorum of replicas. The 
responses will each contain the dependency set computed by the replica that 
sent it, and the coordinator will use the union of these sets as the total set 
of dependencies.

> In any case, further information about how the dependancy graph is computed 
> would be interesting.

It is defined precisely in the paper, but it is simply all those conflicting 
transactions the replica has seen that were initially

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-20 Thread bened...@apache.org
Hi Joey,

Thanks for the feedback and suggestions.

> I was wondering what do you think about having some extended Q&A after your 
> ApacheCon talk Wednesday

I would love to do this. I’ll have to figure out how though – my understanding 
is that I have a hard 40m for my talk and any Q&A, and I expect the talk to 
occupy most of those 40m as I try to cover both the CEP-14 and CEP-15. I’m not 
sure what facilities are made available by Hopin, but if necessary we can 
perhaps post some external video chat link?

The time of day is also a question, as I think the last talk ends at 9:20pm 
local time. But we can make that work if necessary.

> It might help to have a diagram (perhaps I can collaborate with you
on this?)

I absolutely agree. This is something I had planned to produce but it’s been a 
question of time. In part I wanted to ensure we published long in advance of 
ApacheCon, but now also with CEP-10, CEP-14 and CEP-15 in flight it’s hard to 
get back to improving the draft. If you’d be interested in collaborating on 
this that would be super appreciated, as this would certainly help the reader.

>I think that WAN is always paid during the Consensus Protocol, and then in 
>most cases execution can remain LAN except in 3+ datacenters where I think 
>you'd have to include at least one replica in a neighboring datacenter…

As designed the only WAN cost is consensus as Accord ensures every replica 
receives a complete copy of every transaction, and is aware of any gaps. If 
there are gaps there may be WAN delays as those are filled in. This might occur 
because of network outages, but is most likely to occur when transactions are 
being actively executed by multiple DCs at once – in which case there’ll be one 
further unidirectional WAN latency during execution while the earlier 
transaction disseminates its result to the later transaction(s). There are 
other similar scenario we can discuss, e.g. if a transaction takes the slow 
path and will execute after a transaction being executed in another DC, that 
remote transaction needs to receive this notification before executing.

There might potentially be some interesting optimisations to make in future, 
where with many queued transactions a single DC may nominate itself to execute 
all outstanding queries and respond to the remote DCs that issued them so as to 
eliminate the WAN latency for disseminating the result of each transaction. But 
we’re getting way ahead of ourselves there 😊

There’s also no LAN cost on write, at least for responding to the client. If 
there is a dependent transaction within the same DC then (as in the above case) 
there will be a LAN penalty for the second transaction to execute.

> Relatedly I'm curious if there is any way that the client can
acquire the timestamp used by the transaction before sending the data
so we can make the operations idempotent and unrelated to the
coordinator that was executing them as the storage nodes are
vulnerable to disk and heap failure modes which makes them much more
likely to enter grey failure (slow). Alternatively, perhaps it would
make sense to introduce a set of optional dedicated C* nodes for
reaching consensus that do not act as storage nodes so we don't have
to worry about hanging coordinators (join_ring=false?)?

So, in principle coordination can be performed by any node on the network 
including a client – though we’d need to issue the client a unique id this can 
be done cheaply on joining. This might be something to explore in future, 
though there are downsides to having more coordinators too (more likely to 
fail, and stall further transactions that depend on transactions it is 
coordinating).

However, with respect to idempotency, I expect Accord not to perpetuate the 
problems of LWTs where the result of an earlier query is unknown. At least 
success/fail will be maintained in a distributed fashion for some reasonable 
time horizon, and there will also be protection against zombie transactions 
(those proposed to a node that went into a failure spiral before reaching 
healthy nodes, that somehow regurgitates it hours or days later), so we should 
be able to provide practical precisely-once semantics to clients.

Whether this is done with a client provided timestamp, or simply some other 
arbitrary client-provided id that can be utilised to deduplicate requests or 
query the status of a transaction is something we can explore later. This is 
something we should explore in a dedicated discussion as development of Accord 
progresses.

> Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should
line 2 read Qt instead of Et?

So, technically as it reads today I think it’s correct. For Line 2 there is 
always some Qt \subseteq Et. I think the problem here is that actually there’s 
a bunch of valid things to do, including picking some arbitrary subset of each 
rho in Pt so long as it contains some Qt. It’s hard to convey the range of 
options precisely. Line 12 of course