RE: Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-31 Thread Unmesh Joshi
Hi Sam,

Great to see this CEP. I have been documenting a few common 'patterns of
distributed systems, and have documented a pattern called 'consistent core
'
referring to the source code of various systems which use a linearizable
metadata store. I have also documented patterns like 'lease'

and
'state watch
'
which are commonly used by a consistent core. I also recently documented
how a typical partition assignment and partition movement is implemented in
systems that use a consistent core-based metadata store. (In systems like
YugabyteDb, Cockroachdb, Kafka etc..)
It might be of some use as a quick reference for this CEP to be compared
with others who use similar architecture.
A quick question about using existing Paxos machinery. I see that
implementing a Replicated Log

needs
significant changes, particularly about how two phases of Paxos are
implemented over the entire log. So will it be better to use Raft instead?


Thanks,
Unmesh

On 2022/08/23 08:50:27 Sam Tunnicliffe wrote:
> Thanks!
> The core of the proposal is around the sequencing metadata changes and
ensuring that they're delivered to/processed by nodes in the right order
and at the right time. The actual mechanisms for imposing that order and
for maintaining the log are pretty simple to implement. We envision using
the existing Paxos machinery by default, but swapping that for an
alternative implemention would not be difficult.
>
>
> > On 22 Aug 2022, at 19:14, Derek Chen-Becker 
wrote:
> >
> > This looks really interesting; thanks for putting this together! Just
so I'm clear on CEP nomenclature, having external management of metadata as
a non-goal doesn't preclude some future use, correct? Coincidentally, I'm
working on my ApacheCon talk on improving modularity in Cassandra and one
of the ideas I'm discussing is pluggably (?) replacing gossip with
something(s) that allow us to externalize some of the complexity of
maintaining consistency. I need to digest the proposal you've made, but I
don't see the two ideas being at odds on my first read.
> >
> > Cheers,
> >
> > Derek
> >
> > On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe > wrote:
> > Hi,
> >
> > I'd like to open discussion about this CEP:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
<
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>

> > Cluster metadata in Cassandra comprises a number of disparate elements
including, but not limited to, distributed schema, topology and token
ownership. Following the general design principles of Cassandra, the
mechanisms for coordinating updates to cluster state have favoured eventual
consistency, with probabilisitic delivery via gossip being a prime example.
Undoubtedly, this approach has benefits, not least in terms of resilience,
particularly in highly fluid distributed environments. However, this is not
the reality of most Cassandra deployments, where the total number of nodes
is relatively small (i.e. in the low thousands) and the rate of change
tends to be low.
> >
> > Historically, a significant proportion of issues affecting operators
and users of Cassandra have been due, at least in part, to a lack of
strongly consistent cluster metadata. In response to this, we propose a
design which aims to provide linearizability of metadata changes whilst
ensuring that the effects of those changes are made visible to all nodes in
a strongly consistent manner. At its core, it is also pluggable, enabling
Cassandra-derived projects to supply their own implementations if desired.
> >
> > In addition to the CEP document itself, we aim to publish a working
prototype of the proposed design. Obviously, this does not implement the
entire proposal and there are several parts which remain only partially
complete. It does include the core of the system, including a good deal of
test infrastructure, so may serve as both illustration of the design and a
starting point for real implementation.
> >
> >
> >
> > --
> > +---+
> > | Derek Chen-Becker |
> > | GPG Key available at https://keybase.io/dchenbecker <
https://keybase.io/dchenbecker> and |
> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org <
https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org> |
> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC |
> > +---+
> >
>
>


Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-09-01 Thread Unmesh Joshi
On Thu, Sep 1, 2022 at 11:20 AM Alex Petrov  wrote:

> There will be no changes required to our existing Paxos implementation. We
> can just use it. Besides, Paxos is only used as K-sequencer. There is no
> need to use Raft, and both existing LWTs (with Multi-Paxos) and Accord
> aren't tied to a single leader, which is well in the spirit of Cassandra.
>

Will the CMS log implementation be documented in another CEP?  There are
subtle things like dealing with uncommitted incomplete writes or
propagating committed log entries to all the CMS replicas while deciding
how to maintain commit-index for the log will be a good detail to add?
The LWT Paxos implementation does this for the per key instance of Paxos
when a new Paxos read/write triggered (with special handling of committed
values).

Thanks,
Unmesh


Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-09-01 Thread Unmesh Joshi
>
> I think implementation has to work according to expectations described in
> CEP, and have enough tests to prove it. You can follow the progress of the
> patch whenever CEP is accepted and code is published to learn about the
> details.
>

Thanks, will follow the implementation.

If you'd like to learn more about incomplete Paxos writes (I'm assuming you
> mean dealing with inability of proposer to collect a second quorum), you
> can refer to Cassandra Paxos implementation. In our prototypes, we were
> able to simply use Cassandra Paxos out of the box, and everything related
> to Paxos is hidden from us behind CQL syntax.
>

Yes, it's the inability of the proposer to collect a second quorum. As I
understand the existing LWT Paxos implementation is per key instance of
Paxos. Being a key-value setup, it can always repair incomplete paxos runs
when the key is read, For immutable log entries for CMS, it needs to be
different. (LWT is also expecting a mutable operation on key. So it
requires resetting of paxos state on commit and handling committed values
separately as part of prepare. For immutable entries, that's not required).
But will wait for the Jira and PR to understand the proposed approach
better.

Thanks,
Unmesh



> On Thu, Sep 1, 2022, at 11:08 AM, Unmesh Joshi wrote:
>
> On Thu, Sep 1, 2022 at 11:20 AM Alex Petrov  wrote:
>
>
> There will be no changes required to our existing Paxos implementation. We
> can just use it. Besides, Paxos is only used as K-sequencer. There is no
> need to use Raft, and both existing LWTs (with Multi-Paxos) and Accord
> aren't tied to a single leader, which is well in the spirit of Cassandra.
>
>
> Will the CMS log implementation be documented in another CEP?  There are
> subtle things like dealing with uncommitted incomplete writes or
> propagating committed log entries to all the CMS replicas while deciding
> how to maintain commit-index for the log will be a good detail to add?
> The LWT Paxos implementation does this for the per key instance of Paxos
> when a new Paxos read/write triggered (with special handling of committed
> values).
>
> Thanks,
> Unmesh
>
>
>