Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-18 Thread Benedict Elliott Smith
It doesn't seem like there's much enthusiasm for any of the options available 
here...

On 12/11/2020, 14:37, "Benedict Elliott Smith"  wrote:

> Is the new implementation a separate, distinctly modularized new body of 
work

It’s primarily a distinct, modularised and new body of work, however there 
is some shared code that has been modified - namely PaxosState, in which legacy 
code is maintained but modified for compatibility, and the system.paxos table 
(which receives a new column, and slightly modified serialization code).  It is 
conceptually an optimised version of the existing algorithm.

If there's a chance of being of value to 4.0, I can try to put up a patch 
next week alongside a high level description of the changes.

> But a performance regression is a regression, I'm not shrugging it off.

I don't want to give the impression I'm shrugging off the correctness issue 
either. It's a serious issue to fix, but since all successful updates to the 
database are linearizable, I think it's likely that many applications behave 
correctly with the present semantics, or at least encounter only transient 
errors. No doubt many also do not, but I have no idea of the ratio.

The regression isn't itself a simple issue either - depending on the 
topology and message latencies it is not difficult to produce inescapable 
contention, i.e. guaranteed timeouts - that might persist as long as clients 
continue to retry. It could be quite a serious degradation of service to impose 
on our users.

I don't pretend to know the correct way to make a decision balancing these 
considerations, but I am perhaps more concerned about imposing service outages 
than I am temporarily maintaining semantics our users have apparently accepted 
for years - though I absolutely share your embarrassment there.


On 12/11/2020, 12:41, "Joshua McKenzie"  wrote:

Is the new implementation a separate, distinctly modularized new body of
work or does it make substantial changes to existing implementation and
subsume it?

On Thu, Nov 12, 2020 at 3:56 AM Sylvain Lebresne  
wrote:

> Regarding option #4, I'll remark that experience tends to suggest 
users
> don't consistently read the `NEWS.txt` file on upgrade, so option #4 
will
> likely essentially mean "LWT has a correctness issue, but once it 
broke
> your data enough that you'll notice, you'll be able to dig the proper 
flag
> to fix it for next time". I guess it's better than nothing, of 
course, but
> I'll admit that defaulting to "opt-in correctness", especially for a
> feature (LWT) that exists uniquely to provide additional guarantees, 
is
> something I have a hard rallying behind.
>
> But a performance regression is a regression, I'm not shrugging it 
off.
> Still, I feel we shouldn't leave LWT with a fairly serious known
> correctness bug and I frankly feel bad for "the project" that this 
has been
> known for so long without action, so I'm a bit biased in wanting to 
get it
> fixed asap.
>
> But maybe I'm overstating the urgency here, and maybe option #1 is a 
better
> way forward.
>
> --
> Sylvain
>



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-18 Thread Jeff Jirsa
This is complicated and relatively few people on earth understand it, so
having little feedback is mostly expected, unfortunately.

My normal emotional response is "correctness is required, opt-in to
performance improvements that sacrifice strict correctness", but I'm also
sure this is going to surprise people, and would understand / accept #4
(default to current, opt-in to correct).


On Wed, Nov 18, 2020 at 4:54 AM Benedict Elliott Smith 
wrote:

> It doesn't seem like there's much enthusiasm for any of the options
> available here...
>
> On 12/11/2020, 14:37, "Benedict Elliott Smith" 
> wrote:
>
> > Is the new implementation a separate, distinctly modularized new
> body of work
>
> It’s primarily a distinct, modularised and new body of work, however
> there is some shared code that has been modified - namely PaxosState, in
> which legacy code is maintained but modified for compatibility, and the
> system.paxos table (which receives a new column, and slightly modified
> serialization code).  It is conceptually an optimised version of the
> existing algorithm.
>
> If there's a chance of being of value to 4.0, I can try to put up a
> patch next week alongside a high level description of the changes.
>
> > But a performance regression is a regression, I'm not shrugging it
> off.
>
> I don't want to give the impression I'm shrugging off the correctness
> issue either. It's a serious issue to fix, but since all successful updates
> to the database are linearizable, I think it's likely that many
> applications behave correctly with the present semantics, or at least
> encounter only transient errors. No doubt many also do not, but I have no
> idea of the ratio.
>
> The regression isn't itself a simple issue either - depending on the
> topology and message latencies it is not difficult to produce inescapable
> contention, i.e. guaranteed timeouts - that might persist as long as
> clients continue to retry. It could be quite a serious degradation of
> service to impose on our users.
>
> I don't pretend to know the correct way to make a decision balancing
> these considerations, but I am perhaps more concerned about imposing
> service outages than I am temporarily maintaining semantics our users have
> apparently accepted for years - though I absolutely share your
> embarrassment there.
>
>
> On 12/11/2020, 12:41, "Joshua McKenzie"  wrote:
>
> Is the new implementation a separate, distinctly modularized new
> body of
> work or does it make substantial changes to existing
> implementation and
> subsume it?
>
> On Thu, Nov 12, 2020 at 3:56 AM Sylvain Lebresne <
> lebre...@gmail.com> wrote:
>
> > Regarding option #4, I'll remark that experience tends to
> suggest users
> > don't consistently read the `NEWS.txt` file on upgrade, so
> option #4 will
> > likely essentially mean "LWT has a correctness issue, but once
> it broke
> > your data enough that you'll notice, you'll be able to dig the
> proper flag
> > to fix it for next time". I guess it's better than nothing, of
> course, but
> > I'll admit that defaulting to "opt-in correctness", especially
> for a
> > feature (LWT) that exists uniquely to provide additional
> guarantees, is
> > something I have a hard rallying behind.
> >
> > But a performance regression is a regression, I'm not shrugging
> it off.
> > Still, I feel we shouldn't leave LWT with a fairly serious known
> > correctness bug and I frankly feel bad for "the project" that
> this has been
> > known for so long without action, so I'm a bit biased in wanting
> to get it
> > fixed asap.
> >
> > But maybe I'm overstating the urgency here, and maybe option #1
> is a better
> > way forward.
> >
> > --
> > Sylvain
> >
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-18 Thread Benedict Elliott Smith
Perhaps there might be broader appetite to weigh in on which major releases we 
might target for work that fixes the correctness bug without serious 
performance regression?

i.e., if we were to fix the correctness bug now, introducing a serious 
performance regression (either opt-in or opt-out), but were to land work 
without this problem for 5.0, would there be appetite to backport this work to 
any of 4.0, 3.11 or 3.0? 


On 18/11/2020, 18:31, "Jeff Jirsa"  wrote:

This is complicated and relatively few people on earth understand it, so
having little feedback is mostly expected, unfortunately.

My normal emotional response is "correctness is required, opt-in to
performance improvements that sacrifice strict correctness", but I'm also
sure this is going to surprise people, and would understand / accept #4
(default to current, opt-in to correct).


On Wed, Nov 18, 2020 at 4:54 AM Benedict Elliott Smith 
wrote:

> It doesn't seem like there's much enthusiasm for any of the options
> available here...
>
> On 12/11/2020, 14:37, "Benedict Elliott Smith" 
> wrote:
>
> > Is the new implementation a separate, distinctly modularized new
> body of work
>
> It’s primarily a distinct, modularised and new body of work, however
> there is some shared code that has been modified - namely PaxosState, in
> which legacy code is maintained but modified for compatibility, and the
> system.paxos table (which receives a new column, and slightly modified
> serialization code).  It is conceptually an optimised version of the
> existing algorithm.
>
> If there's a chance of being of value to 4.0, I can try to put up a
> patch next week alongside a high level description of the changes.
>
> > But a performance regression is a regression, I'm not shrugging it
> off.
>
> I don't want to give the impression I'm shrugging off the correctness
> issue either. It's a serious issue to fix, but since all successful 
updates
> to the database are linearizable, I think it's likely that many
> applications behave correctly with the present semantics, or at least
> encounter only transient errors. No doubt many also do not, but I have no
> idea of the ratio.
>
> The regression isn't itself a simple issue either - depending on the
> topology and message latencies it is not difficult to produce inescapable
> contention, i.e. guaranteed timeouts - that might persist as long as
> clients continue to retry. It could be quite a serious degradation of
> service to impose on our users.
>
> I don't pretend to know the correct way to make a decision balancing
> these considerations, but I am perhaps more concerned about imposing
> service outages than I am temporarily maintaining semantics our users have
> apparently accepted for years - though I absolutely share your
> embarrassment there.
>
>
> On 12/11/2020, 12:41, "Joshua McKenzie"  wrote:
>
> Is the new implementation a separate, distinctly modularized new
> body of
> work or does it make substantial changes to existing
> implementation and
> subsume it?
>
> On Thu, Nov 12, 2020 at 3:56 AM Sylvain Lebresne <
> lebre...@gmail.com> wrote:
>
> > Regarding option #4, I'll remark that experience tends to
> suggest users
> > don't consistently read the `NEWS.txt` file on upgrade, so
> option #4 will
> > likely essentially mean "LWT has a correctness issue, but once
> it broke
> > your data enough that you'll notice, you'll be able to dig the
> proper flag
> > to fix it for next time". I guess it's better than nothing, of
> course, but
> > I'll admit that defaulting to "opt-in correctness", especially
> for a
> > feature (LWT) that exists uniquely to provide additional
> guarantees, is
> > something I have a hard rallying behind.
> >
> > But a performance regression is a regression, I'm not shrugging
> it off.
> > Still, I feel we shouldn't leave LWT with a fairly serious known
> > correctness bug and I frankly feel bad for "the project" that
> this has been
> > known for so long without action, so I'm a bit biased in wanting
> to get it
> > fixed asap.
> >
> > But maybe I'm overstating the urgency here, and maybe option #1
> is a better
> > way forward.
> >
> > --
> > Sylvain
> >
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.a

Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-18 Thread David Capwell
I feel that #4 (fix bug and add flag to roll back to old behavior) is best.

About the alternative implementation, I am fine adding it to 3.x and 4.0,
but should treat it as a different path disabled by default that you can
opt-into, with a plan to opt-in by default "eventually".

On Wed, Nov 18, 2020 at 11:10 AM Benedict Elliott Smith 
wrote:

> Perhaps there might be broader appetite to weigh in on which major
> releases we might target for work that fixes the correctness bug without
> serious performance regression?
>
> i.e., if we were to fix the correctness bug now, introducing a serious
> performance regression (either opt-in or opt-out), but were to land work
> without this problem for 5.0, would there be appetite to backport this work
> to any of 4.0, 3.11 or 3.0?
>
>
> On 18/11/2020, 18:31, "Jeff Jirsa"  wrote:
>
> This is complicated and relatively few people on earth understand it,
> so
> having little feedback is mostly expected, unfortunately.
>
> My normal emotional response is "correctness is required, opt-in to
> performance improvements that sacrifice strict correctness", but I'm
> also
> sure this is going to surprise people, and would understand / accept #4
> (default to current, opt-in to correct).
>
>
> On Wed, Nov 18, 2020 at 4:54 AM Benedict Elliott Smith <
> bened...@apache.org>
> wrote:
>
> > It doesn't seem like there's much enthusiasm for any of the options
> > available here...
> >
> > On 12/11/2020, 14:37, "Benedict Elliott Smith"  >
> > wrote:
> >
> > > Is the new implementation a separate, distinctly modularized
> new
> > body of work
> >
> > It’s primarily a distinct, modularised and new body of work,
> however
> > there is some shared code that has been modified - namely
> PaxosState, in
> > which legacy code is maintained but modified for compatibility, and
> the
> > system.paxos table (which receives a new column, and slightly
> modified
> > serialization code).  It is conceptually an optimised version of the
> > existing algorithm.
> >
> > If there's a chance of being of value to 4.0, I can try to put
> up a
> > patch next week alongside a high level description of the changes.
> >
> > > But a performance regression is a regression, I'm not
> shrugging it
> > off.
> >
> > I don't want to give the impression I'm shrugging off the
> correctness
> > issue either. It's a serious issue to fix, but since all successful
> updates
> > to the database are linearizable, I think it's likely that many
> > applications behave correctly with the present semantics, or at least
> > encounter only transient errors. No doubt many also do not, but I
> have no
> > idea of the ratio.
> >
> > The regression isn't itself a simple issue either - depending on
> the
> > topology and message latencies it is not difficult to produce
> inescapable
> > contention, i.e. guaranteed timeouts - that might persist as long as
> > clients continue to retry. It could be quite a serious degradation of
> > service to impose on our users.
> >
> > I don't pretend to know the correct way to make a decision
> balancing
> > these considerations, but I am perhaps more concerned about imposing
> > service outages than I am temporarily maintaining semantics our
> users have
> > apparently accepted for years - though I absolutely share your
> > embarrassment there.
> >
> >
> > On 12/11/2020, 12:41, "Joshua McKenzie" 
> wrote:
> >
> > Is the new implementation a separate, distinctly modularized
> new
> > body of
> > work or does it make substantial changes to existing
> > implementation and
> > subsume it?
> >
> > On Thu, Nov 12, 2020 at 3:56 AM Sylvain Lebresne <
> > lebre...@gmail.com> wrote:
> >
> > > Regarding option #4, I'll remark that experience tends to
> > suggest users
> > > don't consistently read the `NEWS.txt` file on upgrade, so
> > option #4 will
> > > likely essentially mean "LWT has a correctness issue, but
> once
> > it broke
> > > your data enough that you'll notice, you'll be able to dig
> the
> > proper flag
> > > to fix it for next time". I guess it's better than
> nothing, of
> > course, but
> > > I'll admit that defaulting to "opt-in correctness",
> especially
> > for a
> > > feature (LWT) that exists uniquely to provide additional
> > guarantees, is
> > > something I have a hard rallying behind.
> > >
> > > But a performance regression is a regression, I'm not
> shrugging
> > it off.
> > > Still, I feel we shouldn't leave LWT with a fairly serious
> known
> > > correctness bug and I frankly feel bad for "the project"