Re: [DISCUSS] CEP-10: Cluster and Code Simulations

bened...@apache.org Tue, 13 Jul 2021 02:26:58 -0700

Hi Benjamin,

The concurrency constructs listed are all _blocking_ concurrency primitives, 
i.e. they put threads to sleep and wake them up. Since the goal of this work is 
pseudorandom execution of the application, trapping thread events is a central 
feature.


The ability to mock the file system is only to ensure the execution is 
_deterministic_. Otherwise a cluster running billions of simulations would be 
almost useless, as you would not readily be able to reproduce the sequence on a 
local machine. The execution order is extremely brittle, with even a different 
patch release of the JVM being able to produce a different sequence of 
execution (in some cases, at least – no doubt many patch releases do not have 
ordering impacts).

The best example of this work is the LWT linearizability verifier that will be 
included with it, which is quite a simple test to put together with the 
simulator: you simply issue some LWT reads and writes to a cluster, and the 
simulator intercepts* every message and thread (and in some specific relevant 
cases, memory access) event, and executes them in pseudorandom order. Each run 
exhibits unique behaviour, exploring different edge cases in the system. If we 
were to only intercept message events, we would fail to explore a wide variety 
of potentially erroneous states in the system – including even those only 
related to message delivery (in the real world, responses can be received 
before the thread sending them completes the act of doing so, for instance).


From: Benjamin Lerer <ble...@apache.org>
Date: Tuesday, 13 July 2021 at 09:50
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
Hi Benedict, Sam,

Could you describe some of the scenarios that this new framework will allow
us to test ? They might help me to understand the changes required.
The need for the changes around concurrency and file access is not obvious
to me. By consequence, I am guessing that I probably do not fully
understand the goal of the proposal.

Thanks in advance

Benjamin


Le mar. 13 juil. 2021 à 10:37, Sam Tunnicliffe <s...@beobal.com> a écrit :

> Spoiler alert: I am pretty familiar with the proposal and the off-list
> work that has been done toward it.
>
> From my perspective, I have no qualms about putting this CEP up for a
> vote. Having seen the potential (and to some degree, realised) benefit of
> this proposal, I am
> convinced of its value.
>
> Thanks,
> Sam
>
> > On 13 Jul 2021, at 09:20, bened...@apache.org wrote:
> >
> > Did anyone have any thoughts on this CEP, or shall I bring it forward
> for a vote also?
> >
> > From: bened...@apache.org <bened...@apache.org>
> > Date: Thursday, 3 June 2021 at 20:19
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: [DISCUSS] CEP-10: Cluster and Code Simulations
> > Proposal for a mechanism to evaluate whole clusters, or individual
> classes, with a deterministically pseudorandom ordering of all thread and
> message events.
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations
> >
> > Evaluating the correctness of distributed systems is hard, as I’m sure
> every developer on this list appreciates. As the project has matured, we
> have had to grapple more with the guarantees we provide users for features
> we develop, and the semantics we promise, particularly around edge-cases
> between two mechanisms or systems.
> >
> > This work aims to dramatically reduce the project overhead necessary for
> delivering a bug-free Cassandra.
> >
> > The premise is to intercept all relevant events that could be performed
> in a different order, i.e. primarily message delivery and thread events
> such as executor submission, signalling of threads, lock acquisition and
> release, and even volatile reads and writes (to a lesser extent). These
> events are then scheduled pseudo-randomly (with various restrictions to
> ensure a valid execution), or in some cases not evaluated at all (to
> simulate e.g. messages being lost). The result is a repeatable sequential
> evaluation of a multi-threaded, multi-actor system.
> >
> > This permits us to evaluate a much broader range of cluster behaviours
> without any additional development work, permitting us to implement a broad
> range of property-based and related randomized acceptance tests, without
> significant developer burden.
> >
> > The work will apply just as readily to multi-threaded single classes as
> it will to whole clusters, and will come with a linearizability test for
> LWTs as well as a unit test for an existing multi-threaded bug that is
> otherwise hard to exhibit.
> >
> > To achieve this, significant modifications will be required to the
> codebase, mostly cleaning up existing abstractions. Specifically, we will
> need to be able to mock executors, any blocking concurrency primitives,
> time, filesystem access and internode streaming.
> >
> > The work is – in large part – already complete, with JIRA and PRs to
> follow in the coming weeks. Of course, the work is subject to the usual
> community input and review, so this does not preclude changes to the work
> (even significant ones, if they are warranted). I know a lot of incoming
> CEP are likely to be backed up by significant off-list development as a
> result of the focus on a shippable 4.0. Hopefully this is just a temporary
> growing pain, particularly as we move towards a shippable trunk.
> >
> > I hope this work will be of huge value to the project, particularly as
> we race to catch up on years of limited feature development.
> >
> > JIRA and PRs will follow, but I wanted to kick-off discussion in advance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Reply via email to