On Tue, Aug 23, 2022 at 2:51 AM Sam Tunnicliffe <s...@beobal.com> wrote:
> > Regular read/write operations should not be halted, even by a total > failure of the metadata service. There should be no situations where the a > previously stable database becomes entirely unavailable due to a CMS > failure. The worst case is where there is some unavailability due to > permanent failure of multiple nodes where those nodes happen to represent a > majority of the CMS. In this scenario, the CMS would need to be recovered > before the down nodes could be replaced, so it's possible it would extend > the period of unavailabilty, though not necessarily by much. > This seems like a reasonable tradeoff. The current approach tries to achieve better availability at the risk of loss of consistency, but even then has failure modes that require manual intervention. What I really like about the proposal is that it's a path toward separating responsibilities of the cluster into well-defined sub-services. Cheers, Derek > > > On 23 Aug 2022, at 05:42, Jeff Jirsa <jji...@gmail.com> wrote: > > “ The proposed mechanism for dealing with both of these failure types is > to enable a manual operator override mode. This would allow operators to > inject metadata changes (potentially overriding the complete metadata > state) directly on any and all nodes in a cluster. At the most extreme end > of the spectrum, this could allow an unrecoverably corrupt state to be > rectified by composing a custom snapshot of cluster metadata and uploading > it to all nodes in the cluster” > > What do you expect this to look like in practice? JSON representation of > the ring? Would reads and writes have halted? In what situations would the > database be entirely unavailable? > > > > On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker <de...@chen-becker.org> > wrote: > > > This looks really interesting; thanks for putting this together! Just so > I'm clear on CEP nomenclature, having external management of metadata as a > non-goal doesn't preclude some future use, correct? Coincidentally, I'm > working on my ApacheCon talk on improving modularity in Cassandra and one > of the ideas I'm discussing is pluggably (?) replacing gossip with > something(s) that allow us to externalize some of the complexity of > maintaining consistency. I need to digest the proposal you've made, but I > don't see the two ideas being at odds on my first read. > > Cheers, > > Derek > > On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <s...@beobal.com> wrote: > >> Hi, >> >> I'd like to open discussion about this CEP: >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata> >> >> Cluster metadata in Cassandra comprises a number of disparate elements >> including, but not limited to, distributed schema, topology and token >> ownership. Following the general design principles of Cassandra, the >> mechanisms for coordinating updates to cluster state have favoured eventual >> consistency, with probabilisitic delivery via gossip being a prime example. >> Undoubtedly, this approach has benefits, not least in terms of resilience, >> particularly in highly fluid distributed environments. However, this is not >> the reality of most Cassandra deployments, where the total number of nodes >> is relatively small (i.e. in the low thousands) and the rate of change >> tends to be low. >> >> Historically, a significant proportion of issues affecting operators and >> users of Cassandra have been due, at least in part, to a lack of strongly >> consistent cluster metadata. In response to this, we propose a design which >> aims to provide linearizability of metadata changes whilst ensuring that >> the effects of those changes are made visible to all nodes in a strongly >> consistent manner. At its core, it is also pluggable, enabling >> Cassandra-derived projects to supply their own implementations if desired. >> In addition to the CEP document itself, we aim to publish a working >> prototype of the proposed design. Obviously, this does not implement the >> entire proposal and there are several parts which remain only partially >> complete. It does include the core of the system, including a good deal of >> test infrastructure, so may serve as both illustration of the design and a >> starting point for real implementation. >> >> > > -- > +---------------------------------------------------------------+ > | Derek Chen-Becker | > | GPG Key available at https://keybase.io/dchenbecker and | > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > +---------------------------------------------------------------+ > > > -- +---------------------------------------------------------------+ | Derek Chen-Becker | | GPG Key available at https://keybase.io/dchenbecker and | | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | +---------------------------------------------------------------+