Re: [DISCUSS] CEP-15: General Purpose Transactions
If Accord can fulfil its aims it sounds like a huge improvement to the state of the art in distributed transaction processing. Congrats to all involved in pulling the proposal together. I was holding off on feedback since this is quite in depth and I don’t want to bike shed, I still haven’t spent as much time understanding this as I’d like. Regardless, I’ll make the following notes in case they’re helpful. My feedback is more to satisfy my own curiosity and stimulate discussion than to suggest that there are any flaws here. I applaud the proposed testing approach and think it is the only way to be certain that the proposed consistency guarantees will be upheld. General I’m curious if/how this proposal addresses issues we have seen when scaling; I see reference to simple majorities of nodes - is there any plan to ensure safety under scaling operations or DC (de)commissioning? What consistency levels will be supported under Accord? Will it simply be a single CL representing a majority of nodes across the whole cluster? (This at least would mitigate the issues I’ve seen when folks want to switch from EACH_SERIAL to SERIAL). Accord > Accord instead assembles an inconsistent set of dependencies. Further explanation here would be good. Do we mean to say that the dependancies may differ according to which transactions the coordinator has witnessed at the time the incoming transaction is first seen? This would make sense if some nodes had not fully committed a foregoing transaction. Is it correct to think of this step as assembling a dependancy graph of foregoing transactions which must be completed ahead of progressing the incoming new transaction? Fast Path > A coordinator C proposes a timestamp t0 to at least a quorum of a fast path > electorate. If t0 is larger than all timestamps witnessed for all prior > conflicting transactions, t0 is accepted by a replica. If a fast path quorum > of responses accept, the transaction is agreed to execute at t0. Replicas > respond with the set of transactions they have witnessed that may execute > with a lower timestamp, i.e. those with a lower t0. What is t0 here? I’m guessing it is the Lamport clock time of the most recent mutation to the partition? May be worth clarifying because otherwise the perception may be that it is the commencement time of the current transaction which may not be the intention. Regarding the use of logical clocks in general - Do we have one clock-per-shard-per-node? Or is there a single clock for all transactions on a node? What happens in network partitions? In a cross-shard transaction does maintaining simple majorities of replicas protect you from potential inconsistencies arising when a transaction W10 addressing partitions p1, p2 comes from a different majority (potentially isolated due to a network partition) from earlier writes W[1,9] to p1 only? It seems that this may cause a sudden change to the dependancy graph for partition p2 which may render it vulnerable to strange effects? Do we consider adversarial cases or any sort of byzantine faults? (That’s a bit out of left field, feel free to kick me.) Why do we prefer Lamport clocks to vector clocks or other types of logical clock? Slow Path > This value is proposed to at least a simple majority of nodes, along with the > union of the dependenciesreceived Related to the earlier point: when we say `union` here - what set are we forming a union over? Is it a union of all dependancies t_n < t as seen by all coordinators? I presume that the logic precludes the possibility that these dependancies will conflict, since all foregoing transactions which are in progress as dependancies must be non-conflicting with earlier transactions in the dependancy graph? In any case, further information about how the dependancy graph is computed would be interesting. > The inclusion of dependencies in the proposal is solely to facilitate > Recovery of other transactions that may be incomplete - these are stored on > each replica to facilitate decisions at recovery. Every replica? Or only those participating in the transaction? > If C fails to reach fast path consensus it takes the highest t it witnessed > from its responses, which constitutes a simple Lamport clock value imposing a > valid total order. This value is proposed to at least a simple majority of > nodes, When speaking about the simple majority of nodes to whom the max(t) value returned will be proposed to - It sounds like this need not be the same majority from whom the original sets of T_n and dependancies was obtained? Is there a proof to show that the dependancies created from the union of the first set of replicas resolves to an acceptable dependancy graph for an arbitrary majority of replicas? (Especially given that a majority of replicas is not a majority of nodes, given we are in a cross-shard scenario here). What happens in cases where the replica set has change
Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, Thank you very much for advancing this proposal, I'm extremely excited to see flexible quorums used in this way and am looking forward to the integration of Accord into Cassandra! I read the whitepaper and have a few questions, but I was wondering what do you think about having some extended Q&A after your ApacheCon talk Wednesday (maybe at the end of the C* track)? It might be higher bandwidth than going back and forth on email/slack (also given you're presenting on it that might be a good time to discuss it)? Briefly * It might help to have a diagram (perhaps I can collaborate with you on this?) showing the happy path delay waiting in the reorder buffer and the messages that are sent in a 2 and 3 datacenter deployment during the PreAccept, Accept, Commit, Execute, Apply phases. In particular it was hard for me to follow where exactly I was paying WAN latency and where we could achieve progress with LAN only (I think that WAN is always paid during the Consensus Protocol, and then in most cases execution can remain LAN except in 3+ datacenters where I think you'd have to include at least one replica in a neighboring datacenter). In particular, it seems that Accord always pays clock skew + WAN latency during the reorder buffer (as part of consensus) + 2x LAN latency during execution (to read and then write). * Relatedly I'm curious if there is any way that the client can acquire the timestamp used by the transaction before sending the data so we can make the operations idempotent and unrelated to the coordinator that was executing them as the storage nodes are vulnerable to disk and heap failure modes which makes them much more likely to enter grey failure (slow). Alternatively, perhaps it would make sense to introduce a set of optional dedicated C* nodes for reaching consensus that do not act as storage nodes so we don't have to worry about hanging coordinators (join_ring=false?)? * Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should line 2 read Qt instead of Et? * I think your claims about clock skew being <1ms in general is accurate at least for AWS except for when machines boot for the first time (I can send you some data shortly). It might make sense for participating members to wait for a minimum detected clock skew before becoming eligible for electorate? * I don't really understand how temporarily down replicas will learn of mutations they missed, did I miss the part where a read replica would recover all transactions between its last accepted time and another replica's last accepted time? Or are we just leveraging some external repair? * Relatedly since non-transactional reads wouldn't flow through consensus (I hope) would it make sense for a restarting node to learn the latest accepted time once and then be deprioritized for all reads until it has accepted what it missed? Or is the idea that you would _always_ read transactionally (and since it's a read only transaction you can skip the WAN consensus and just go straight to fast path reads)? * I know the paper says that we elide details of how the shards (aka replica sets?) are chosen, but it seems that this system would have a hard dependency on a strongly consistent shard selection system (aka token metadata?) wouldn't it? In particular if the simple quorums (which I interpreted to be replica sets in current C*, not sure if that's correct) can change in non linearizable ways I don't think Property 3.3 can hold. I think you hint at a solution to this in section 5 but I'm not sure I grok it. Super interesting proposal and I am looking forward to all the improvements this will bring to the project! Cheers, -Joey On Mon, Sep 20, 2021 at 1:34 AM Miles Garnsey wrote: > > If Accord can fulfil its aims it sounds like a huge improvement to the state > of the art in distributed transaction processing. Congrats to all involved in > pulling the proposal together. > > I was holding off on feedback since this is quite in depth and I don’t want > to bike shed, I still haven’t spent as much time understanding this as I’d > like. > > Regardless, I’ll make the following notes in case they’re helpful. My > feedback is more to satisfy my own curiosity and stimulate discussion than to > suggest that there are any flaws here. I applaud the proposed testing > approach and think it is the only way to be certain that the proposed > consistency guarantees will be upheld. > > General > > I’m curious if/how this proposal addresses issues we have seen when scaling; > I see reference to simple majorities of nodes - is there any plan to ensure > safety under scaling operations or DC (de)commissioning? > > What consistency levels will be supported under Accord? Will it simply be a > single CL representing a majority of nodes across the whole cluster? (This at > least would mitigate the issues I’ve seen when folks want to switch from > EACH_SERIAL to SERIAL). > > Accord > > > Accord instead assembles an inconsistent set of dependenc
Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Miles, Thanks for the interest and your questions. So far as I can tell from your questions you are basing your understanding on the wiki page – in which case I would recommend reading the whitepaper, which answers most of your questions. Unfortunately I did not understand every question, but I have tried my best to respond to every point. > is there any plan to ensure safety under scaling operations or DC > (de)commissioning? This is a topic that has come up before under regular Paxos. Many of the topology changes will be safe for Paxos in the near future, and we certainly expect Accord to be safe under all topology changes. Some of this will require improvements to Cassandra that colleagues will be proposing more generally in the near future, but the whitepaper outlines Accord’s side of the equation in section 5. > What consistency levels will be supported under Accord? This is related to the above discussion that I expect to be addressed by colleagues in the near future. I can discuss my beliefs about how the project should move forwards on this, but since I’ve already done so in the past, and I don’t think it is core to Accord, I think it is probably best handled in a dedicated discussion. > Further explanation here would be good. I think on this topic it would be best to consult the whitepaper, as it describes dependencies quite precisely better than I can do here. Some familiarity with the literature e.g. EPaxos would also help. In short: there is no dependency graph for any transaction, just a dependency set. This is a set of transactions that _may_ execute before us. It includes all transactions that _will_ execute before us, but each coordinator for the transaction (i.e. if the coordinator fails and another takes its place) may assemble a different super-set of the true execution dependencies (i.e. those that will actually execute before us) > What is t0 here? This is again described precisely in the whitepaper, but it in essence it is any value the coordinator would like to propose as an execution timestamp (so long as it is globally unique). Normally this is simply (now,coordinator-id). > Do we have one clock-per-shard-per-node? Or is there a single clock for all > transactions on a node? One per unique global identity, so for simplicity probably per node, but it could easily be per shard, or more granular – another global identity would simply need to be issued to ensure the logical clock produces globally unique values. > What happens in network partitions? I’m sorry, I don’t understand the question. > In a cross-shard transaction does maintaining simple majorities of replicas > protect you from potential inconsistencies arising when a transaction W10 > addressing partitions p1, p2 comes from a different majority (potentially > isolated due to a network partition) from earlier writes W[1,9] to p1 only? It seems that this may cause a sudden change to the dependancy graph for partition p2 which may render it vulnerable to strange effects? I don’t really follow the question, but there are no sudden changes to any dependencies, so I think the answer is no. > Do we consider adversarial cases or any sort of byzantine faults? (That’s a > bit out of left field, feel free to kick me.) Specified in the paper, and the answer is “no”. > Why do we prefer Lamport clocks to vector clocks or other types of logical > clock? I’m unsure how to answer this question. Vector clocks are substantially more costly, and it’s unclear what they would offer us here. Perhaps you could explain why you would consider them, and what other logical clocks you are considering? I will note I did not start from the concept of Lamport clocks, however they are a widely understood concept and really Accord boils down to using Lamport clocks to derive _some_ total order, and then ensuring that total order is durable. This seemed like a simple way to explain the protocol, since it’s only a sentence or two. > Related to the earlier point: when we say `union` here - what set are we > forming a union over? Is it a union of all dependancies t_n < t as seen by > all coordinators? I presume that the logic precludes the possibility that > these dependancies will conflict, since all foregoing transactions which are > in progress as dependancies must be non-conflicting with earlier transactions > in the dependancy graph? I don’t understand all of this question, sorry. To the first point, the coordinator will receive responses from some fast-path quorum of replicas. The responses will each contain the dependency set computed by the replica that sent it, and the coordinator will use the union of these sets as the total set of dependencies. > In any case, further information about how the dependancy graph is computed > would be interesting. It is defined precisely in the paper, but it is simply all those conflicting transactions the replica has seen that were initially
Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Joey, Thanks for the feedback and suggestions. > I was wondering what do you think about having some extended Q&A after your > ApacheCon talk Wednesday I would love to do this. I’ll have to figure out how though – my understanding is that I have a hard 40m for my talk and any Q&A, and I expect the talk to occupy most of those 40m as I try to cover both the CEP-14 and CEP-15. I’m not sure what facilities are made available by Hopin, but if necessary we can perhaps post some external video chat link? The time of day is also a question, as I think the last talk ends at 9:20pm local time. But we can make that work if necessary. > It might help to have a diagram (perhaps I can collaborate with you on this?) I absolutely agree. This is something I had planned to produce but it’s been a question of time. In part I wanted to ensure we published long in advance of ApacheCon, but now also with CEP-10, CEP-14 and CEP-15 in flight it’s hard to get back to improving the draft. If you’d be interested in collaborating on this that would be super appreciated, as this would certainly help the reader. >I think that WAN is always paid during the Consensus Protocol, and then in >most cases execution can remain LAN except in 3+ datacenters where I think >you'd have to include at least one replica in a neighboring datacenter… As designed the only WAN cost is consensus as Accord ensures every replica receives a complete copy of every transaction, and is aware of any gaps. If there are gaps there may be WAN delays as those are filled in. This might occur because of network outages, but is most likely to occur when transactions are being actively executed by multiple DCs at once – in which case there’ll be one further unidirectional WAN latency during execution while the earlier transaction disseminates its result to the later transaction(s). There are other similar scenario we can discuss, e.g. if a transaction takes the slow path and will execute after a transaction being executed in another DC, that remote transaction needs to receive this notification before executing. There might potentially be some interesting optimisations to make in future, where with many queued transactions a single DC may nominate itself to execute all outstanding queries and respond to the remote DCs that issued them so as to eliminate the WAN latency for disseminating the result of each transaction. But we’re getting way ahead of ourselves there 😊 There’s also no LAN cost on write, at least for responding to the client. If there is a dependent transaction within the same DC then (as in the above case) there will be a LAN penalty for the second transaction to execute. > Relatedly I'm curious if there is any way that the client can acquire the timestamp used by the transaction before sending the data so we can make the operations idempotent and unrelated to the coordinator that was executing them as the storage nodes are vulnerable to disk and heap failure modes which makes them much more likely to enter grey failure (slow). Alternatively, perhaps it would make sense to introduce a set of optional dedicated C* nodes for reaching consensus that do not act as storage nodes so we don't have to worry about hanging coordinators (join_ring=false?)? So, in principle coordination can be performed by any node on the network including a client – though we’d need to issue the client a unique id this can be done cheaply on joining. This might be something to explore in future, though there are downsides to having more coordinators too (more likely to fail, and stall further transactions that depend on transactions it is coordinating). However, with respect to idempotency, I expect Accord not to perpetuate the problems of LWTs where the result of an earlier query is unknown. At least success/fail will be maintained in a distributed fashion for some reasonable time horizon, and there will also be protection against zombie transactions (those proposed to a node that went into a failure spiral before reaching healthy nodes, that somehow regurgitates it hours or days later), so we should be able to provide practical precisely-once semantics to clients. Whether this is done with a client provided timestamp, or simply some other arbitrary client-provided id that can be utilised to deduplicate requests or query the status of a transaction is something we can explore later. This is something we should explore in a dedicated discussion as development of Accord progresses. > Should Algorithm 1 line 12 be PreAcceptOK from Et (not Qt) or should line 2 read Qt instead of Et? So, technically as it reads today I think it’s correct. For Line 2 there is always some Qt \subseteq Et. I think the problem here is that actually there’s a bunch of valid things to do, including picking some arbitrary subset of each rho in Pt so long as it contains some Qt. It’s hard to convey the range of options precisely. Line 12 of course