On Tue, Aug 23, 2022 at 2:51 AM Sam Tunnicliffe <s...@beobal.com> wrote:

>
> Regular read/write operations should not be halted, even by a total
> failure of the metadata service. There should be no situations where the a
> previously stable database becomes entirely unavailable due to a CMS
> failure. The worst case is where there is some unavailability due to
> permanent failure of multiple nodes where those nodes happen to represent a
> majority of the CMS. In this scenario, the CMS would need to be recovered
> before the down nodes could be replaced, so it's possible it would extend
> the period of unavailabilty, though not necessarily by much.
>

This seems like a reasonable tradeoff. The current approach tries to
achieve better availability at the risk of loss of consistency, but even
then has failure modes that require manual intervention. What I really like
about the proposal is that it's a path toward separating responsibilities
of the cluster into well-defined sub-services.

Cheers,

Derek




>
>
> On 23 Aug 2022, at 05:42, Jeff Jirsa <jji...@gmail.com> wrote:
>
> “ The proposed mechanism for dealing with both of these failure types is
> to enable a manual operator override mode. This would allow operators to
> inject metadata changes (potentially overriding the complete metadata
> state) directly on any and all nodes in a cluster. At the most extreme end
> of the spectrum, this could allow an unrecoverably corrupt state to be
> rectified by composing a custom snapshot of cluster metadata and uploading
> it to all nodes in the cluster”
>
> What do you expect this to look like in practice? JSON representation of
> the ring? Would reads and writes have halted? In what situations would the
> database be entirely unavailable?
>
>
>
> On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker <de...@chen-becker.org>
> wrote:
>
> 
> This looks really interesting; thanks for putting this together! Just so
> I'm clear on CEP nomenclature, having external management of metadata as a
> non-goal doesn't preclude some future use, correct? Coincidentally, I'm
> working on my ApacheCon talk on improving modularity in Cassandra and one
> of the ideas I'm discussing is pluggably (?) replacing gossip with
> something(s) that allow us to externalize some of the complexity of
> maintaining consistency. I need to digest the proposal you've made, but I
> don't see the two ideas being at odds on my first read.
>
> Cheers,
>
> Derek
>
> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <s...@beobal.com> wrote:
>
>> Hi,
>>
>> I'd like to open discussion about this CEP:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>
>>
>> Cluster metadata in Cassandra comprises a number of disparate elements
>> including, but not limited to, distributed schema, topology and token
>> ownership. Following the general design principles of Cassandra, the
>> mechanisms for coordinating updates to cluster state have favoured eventual
>> consistency, with probabilisitic delivery via gossip being a prime example.
>> Undoubtedly, this approach has benefits, not least in terms of resilience,
>> particularly in highly fluid distributed environments. However, this is not
>> the reality of most Cassandra deployments, where the total number of nodes
>> is relatively small (i.e. in the low thousands) and the rate of change
>> tends to be low.
>>
>> Historically, a significant proportion of issues affecting operators and
>> users of Cassandra have been due, at least in part, to a lack of strongly
>> consistent cluster metadata. In response to this, we propose a design which
>> aims to provide linearizability of metadata changes whilst ensuring that
>> the effects of those changes are made visible to all nodes in a strongly
>> consistent manner. At its core, it is also pluggable, enabling
>> Cassandra-derived projects to supply their own implementations if desired.
>> In addition to the CEP document itself, we aim to publish a working
>> prototype of the proposed design. Obviously, this does not implement the
>> entire proposal and there are several parts which remain only partially
>> complete. It does include the core of the system, including a good deal of
>> test infrastructure, so may serve as both illustration of the design and a
>> starting point for real implementation.
>>
>>
>
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>
>
>

-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Reply via email to