Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Oleksandr Petrov
I'll merge 16262 and the Harry blog-post that accompanies it shortly.
Having 16262 merged will significantly reduce the amount of resistance one
has to overcome in order to write a fuzz test. But this, of course, only
covers short/small/unit-test-like tests.

For longer running tests, I guess for now we will have to rely on folks
(hopefully) running long fuzz tests and reporting issues. But eventually
it'd be great to have enough automation around it so that anyone could do
that and where test results are public.

In regard to long-running tests, currently with Harry we can run three
kinds of long-running tests:
1. Stress-like concurrent write workload, followed by periods of quiescence
and then validation
2. Writes with injected faults, followed by repair and validation
3. Stress-like concurrent read/write workload with fault injection without
validation, for finding rare edge conditions / triggering possible
exceptions

Which means that quorum read and write paths (for all kinds of schemas,
including all possible kinds of read and write queries), compactions,
repairs, read-repairs and hints are covered fairly well. However things
like bootstrap and other kinds of range movements aren't. It'd be great to
expand this, but it's been somewhat difficult to do, since last time a
bootstrap test was attempted, it has immediately uncovered enough issues to
keep us busy fixing them for quite some time. Maybe it's about time to try
that again.

For short tests, you can think of Harry as a tool to save you time and
allow focusing on higher-level test meaning rather than creating schema and
coming up with specific values to insert/select.

Thanks
--Alex



On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova 
wrote:

> Did I hear my name? 😁
> Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs
> discovered by pflaky tests and one of them was very hard to hit. So 6-7%. I
> think that report I sent back then didn’t come through so the topic was
> cleared in a follow up mail by Benjamin; with a lot of sweat but we kept to
> the promised 4.0 standard.
>
> Now back to this topic:
> - green CI without enough test coverage is nothing more than green CI
> unfortunately to me.  I know this is an elephant but I won’t sleep well
> tonight if I don’t mention it.
> - I believe the looping of tests mentioned by Berenguer can help for
> verifying no new weird flakiness is introduced by new tests added. And of
> course it helps a lot during fixing flaky tests, I think that’s clear.
>
>  I think that it would be great if each such test
> > > (or
> > > > test group) was guaranteed to have a ticket and some preliminary
> > analysis
> > > > was done to confirm it is just a test problem before releasing the
> new
> > > > version
>
> Probably not bad idea. Preliminary analysis. But we need to get into the
> cadence of regular checking our CI; divide and conquer on regular basis
> between all of us. Not to mention it is way easier to follow up recently
> introduced issues with the people who worked on stuff then trying to find
> out what happened a year ago in a rush before a release. I agree it is not
> about the number but what stays behind it.
>
> Requiring all tests to run pre every merge, easily we can add this in
> circle but there are many people who don’t have access to high resources so
> again they won’t be able to run absolutely everything. At the end
> everything is up to the diligence of the reviewers/committers. Plus
> official CI is Jenkins and we know there are different infra related
> failures in the different CIs. Not an easy topic, indeed. I support running
> all tests, just having in mind all the related issues/complications.
>
> I would say in my mind upgrade tests are particularly important to be green
> before a release, too.
>
> Seems to me we have the tools, but now it is time to organize the rhythm in
> an efficient manner.
>
> Best regards,
> Ekaterina
>
>
> On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie  wrote:
>
> > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some
> > analysis and something like 18% (correct me if I'm wrong here) of the
> test
> > failures we were considering "flaky tests" were actual product defects in
> > the database. With that in mind, we should be uncomfortable cutting a
> > release if we have 6 test failures since there's every likelihood one of
> > them is a surfaced bug.
> >
> > ensuring our best practices are followed for every merge
> >
> > I totally agree but I also don't think we have this codified (unless I'm
> > just completely missing something - very possible! ;)) Seems like we have
> > different circle configs, different sets of jobs being run, Harry /
> Hunter
> > (maybe?) / ?? run on some but not all commits and/or all branches,
> > manual performance testing on specific releases but nothing surfaced
> > formally to the project as a reproducible suite like we used to have
> years
> > ago (primitive though it was at the time with what it c

Re: [DISCUSS] CEP-18: Improving Modularity

2021-11-03 Thread Jeremiah D Jordan
It seems like there are many people in this thread that would rather we not 
make a “grouping” CEP for these ideas, but rather consider each one 
individually.  I will close out this CEP thread then, and discussions can 
continue on individual tickets.

I think we got some nice discussion on this thread on what people would like to 
see from these type of refactoring tickets, so that will be good information to 
take forward as we work on individual tickets.

Thanks for the discussion everyone.

-Jeremiah Jordan

> On Oct 27, 2021, at 1:28 AM, Dinesh Joshi  wrote:
> 
>> On Oct 25, 2021, at 1:22 PM, Jeremiah D Jordan  wrote:
>> 
>> The currently proposed changes in CEP-18 should all include improved test 
>> coverage of the modules in question.  We have been developing them all with 
>> a requirement that all changes have at least %80 code coverage from sonar 
>> cloud jacoco reports.  We have also found and fixed some bugs in the 
>> existing code during this development work.
> 
> This is great! We, as a project, should encourage improved test code 
> coverage. So I welcome this change.
> 
>> So do people feel we should re-propose these as multiple CEP’s or just 
>> tickets?  Or do people prefer to have a discussion/vote on the idea of 
>> improving the modularity of the code base in general?
> 
> My personal preference would be to see this work appear as individual CEPs or 
> even JIRA tickets with discussions but definitely not one giant CEP that is 
> pulling together a lot of different changes.
> 
> I really like the idea of building pluggable modular components. However, I 
> am concerned about few things.
> 
> 1. Performance regression.
> 2. Breaking backward compatibility for our users & tools.
> 3. Interfaces with single implementation.
> 
> I would like to ensure that we are mindful of these concerns while making big 
> refactors.
> 
> Thanks,
> 
> Dinesh


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Joshua McKenzie
>
> It'd be great to
> expand this, but it's been somewhat difficult to do, since last time a
> bootstrap test was attempted, it has immediately uncovered enough issues to
> keep us busy fixing them for quite some time. Maybe it's about time to try
> that again.

I'm going to go with a "yes please". :)

On Wed, Nov 3, 2021 at 9:27 AM Oleksandr Petrov 
wrote:

> I'll merge 16262 and the Harry blog-post that accompanies it shortly.
> Having 16262 merged will significantly reduce the amount of resistance one
> has to overcome in order to write a fuzz test. But this, of course, only
> covers short/small/unit-test-like tests.
>
> For longer running tests, I guess for now we will have to rely on folks
> (hopefully) running long fuzz tests and reporting issues. But eventually
> it'd be great to have enough automation around it so that anyone could do
> that and where test results are public.
>
> In regard to long-running tests, currently with Harry we can run three
> kinds of long-running tests:
> 1. Stress-like concurrent write workload, followed by periods of quiescence
> and then validation
> 2. Writes with injected faults, followed by repair and validation
> 3. Stress-like concurrent read/write workload with fault injection without
> validation, for finding rare edge conditions / triggering possible
> exceptions
>
> Which means that quorum read and write paths (for all kinds of schemas,
> including all possible kinds of read and write queries), compactions,
> repairs, read-repairs and hints are covered fairly well. However things
> like bootstrap and other kinds of range movements aren't. It'd be great to
> expand this, but it's been somewhat difficult to do, since last time a
> bootstrap test was attempted, it has immediately uncovered enough issues to
> keep us busy fixing them for quite some time. Maybe it's about time to try
> that again.
>
> For short tests, you can think of Harry as a tool to save you time and
> allow focusing on higher-level test meaning rather than creating schema and
> coming up with specific values to insert/select.
>
> Thanks
> --Alex
>
>
>
> On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova 
> wrote:
>
> > Did I hear my name? 😁
> > Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs
> > discovered by pflaky tests and one of them was very hard to hit. So
> 6-7%. I
> > think that report I sent back then didn’t come through so the topic was
> > cleared in a follow up mail by Benjamin; with a lot of sweat but we kept
> to
> > the promised 4.0 standard.
> >
> > Now back to this topic:
> > - green CI without enough test coverage is nothing more than green CI
> > unfortunately to me.  I know this is an elephant but I won’t sleep well
> > tonight if I don’t mention it.
> > - I believe the looping of tests mentioned by Berenguer can help for
> > verifying no new weird flakiness is introduced by new tests added. And of
> > course it helps a lot during fixing flaky tests, I think that’s clear.
> >
> >  I think that it would be great if each such test
> > > > (or
> > > > > test group) was guaranteed to have a ticket and some preliminary
> > > analysis
> > > > > was done to confirm it is just a test problem before releasing the
> > new
> > > > > version
> >
> > Probably not bad idea. Preliminary analysis. But we need to get into the
> > cadence of regular checking our CI; divide and conquer on regular basis
> > between all of us. Not to mention it is way easier to follow up recently
> > introduced issues with the people who worked on stuff then trying to find
> > out what happened a year ago in a rush before a release. I agree it is
> not
> > about the number but what stays behind it.
> >
> > Requiring all tests to run pre every merge, easily we can add this in
> > circle but there are many people who don’t have access to high resources
> so
> > again they won’t be able to run absolutely everything. At the end
> > everything is up to the diligence of the reviewers/committers. Plus
> > official CI is Jenkins and we know there are different infra related
> > failures in the different CIs. Not an easy topic, indeed. I support
> running
> > all tests, just having in mind all the related issues/complications.
> >
> > I would say in my mind upgrade tests are particularly important to be
> green
> > before a release, too.
> >
> > Seems to me we have the tools, but now it is time to organize the rhythm
> in
> > an efficient manner.
> >
> > Best regards,
> > Ekaterina
> >
> >
> > On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie 
> wrote:
> >
> > > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some
> > > analysis and something like 18% (correct me if I'm wrong here) of the
> > test
> > > failures we were considering "flaky tests" were actual product defects
> in
> > > the database. With that in mind, we should be uncomfortable cutting a
> > > release if we have 6 test failures since there's every likelihood one
> of
> > > them is a surfaced bug.
> > >
> > > ensuring o

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Mon, Nov 1, 2021 at 5:03 PM David Capwell  wrote:
>
> > How do we define what "releasable trunk" means?
>
> One thing I would love is for us to adopt a “run all tests needed to release 
> before commit” mentality, and to link a successful run in JIRA when closing 
> (we talked about this once in slack).  If we look at CircleCI we currently do 
> not run all the tests needed to sign off; below are the tests disabled in the 
> “pre-commit” workflows (see 
> https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381):

A good first step toward this would be for us to treat our binding +1s
more judiciously, and not grant any without at least a pre-commit CI
run linked in the ticket.  You don't have to look very hard to find a
lot of these today (I know I'm guilty), and it's possible we wouldn't
have the current CI mess now if we had been a little bit more
diligent.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread bened...@apache.org
The largest number of test failures turn out (as pointed out by David) to be 
due to how arcane it was to trigger the full test suite. Hopefully we can get 
on top of that, but I think a significant remaining issue is a lack of trust in 
the output of CI. It’s hard to gate commit on a clean CI run when there’s flaky 
tests, and it doesn’t take much to misattribute one failing test to the 
existing flakiness (I tend to compare to a run of the trunk baseline for 
comparison, but this is burdensome and still error prone). The more flaky tests 
there are the more likely this is.

This is in my opinion the real cost of flaky tests, and it’s probably worth 
trying to crack down on them hard if we can. It’s possible the Simulator may 
help here, when I finally finish it up, as we can port flaky tests to run with 
the Simulator and the failing seed can then be explored deterministically (all 
being well).

From: Brandon Williams 
Date: Wednesday, 3 November 2021 at 17:07
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Releasable trunk and quality
On Mon, Nov 1, 2021 at 5:03 PM David Capwell  wrote:
>
> > How do we define what "releasable trunk" means?
>
> One thing I would love is for us to adopt a “run all tests needed to release 
> before commit” mentality, and to link a successful run in JIRA when closing 
> (we talked about this once in slack).  If we look at CircleCI we currently do 
> not run all the tests needed to sign off; below are the tests disabled in the 
> “pre-commit” workflows (see 
> https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381):

A good first step toward this would be for us to treat our binding +1s
more judiciously, and not grant any without at least a pre-commit CI
run linked in the ticket.  You don't have to look very hard to find a
lot of these today (I know I'm guilty), and it's possible we wouldn't
have the current CI mess now if we had been a little bit more
diligent.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org  wrote:
>
> The largest number of test failures turn out (as pointed out by David) to be 
> due to how arcane it was to trigger the full test suite. Hopefully we can get 
> on top of that, but I think a significant remaining issue is a lack of trust 
> in the output of CI. It’s hard to gate commit on a clean CI run when there’s 
> flaky tests, and it doesn’t take much to misattribute one failing test to the 
> existing flakiness (I tend to compare to a run of the trunk baseline for 
> comparison, but this is burdensome and still error prone). The more flaky 
> tests there are the more likely this is.
>
> This is in my opinion the real cost of flaky tests, and it’s probably worth 
> trying to crack down on them hard if we can. It’s possible the Simulator may 
> help here, when I finally finish it up, as we can port flaky tests to run 
> with the Simulator and the failing seed can then be explored 
> deterministically (all being well).

I totally agree that the lack of trust is a driving problem here, even
in knowing which CI system to rely on. When Jenkins broke but Circle
was fine, we all assumed it was a problem with Jenkins, right up until
Circle also broke.

In testing a distributed system like this I think we're always going
to have failures, even on non-flaky tests, simply because the
underlying infrastructure is variable with transient failures of its
own (the network is reliable!)  We can fix the flakies where the fault
is in the code (and we've done this to many already) but to get more
trustworthy output, I think we're going to need a system that
understands the difference between success, failure, and timeouts, and
in the latter case knows how to at least mark them differently.
Simulator may help, as do the in-jvm dtests, but there is ultimately
no way to cover everything without doing some things the hard, more
realistic way where sometimes shit happens, marring the almost-perfect
runs with noisy doubt, which then has to be sifted through to
determine if there was a real issue.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread David Capwell
> It’s hard to gate commit on a clean CI run when there’s flaky tests

I agree, this is also why so much effort was done in 4.0 release to remove as 
much as possible.  Just over 1 month ago we were not really having a flaky test 
issue (outside of the sporadic timeout issues; my circle ci runs were green 
constantly), and now the “flaky tests” I see are all actual bugs (been root 
causing 2 out of the 3 I reported) and some (not all) of the flakyness was 
triggered by recent changes in the past month.

Right now people do not believe the failing test is caused by their patch and 
attribute to flakiness, which then causes the builds to start being flaky, 
which then leads to a different author coming to fix the issue; this behavior 
is what I would love to see go away.  If we find a flaky test, we should do the 
following

1) has it already been reported and who is working to fix?  Can we block this 
patch on the test being fixed?  Flaky tests due to timing issues normally are 
resolved very quickly, real bugs take longer.
2) if not reported, why?  If you are the first to see this issue than good 
chance the patch caused the issue so should root cause.  If you are not the 
first to see it, why did others not report it (we tend to be good about this, 
even to the point Brandon has to mark the new tickets as dups…)?

I have committed when there were flakiness, and I have caused flakiness; not 
saying I am perfect or that I do the above, just saying that if we all moved to 
the above model we could start relying on CI.  The biggest impact to our 
stability is people actually root causing flaky tests.

>  I think we're going to need a system that
> understands the difference between success, failure, and timeouts


I am curious how this system can know that the timeout is not an actual 
failure.  There was a bug in 4.0 with time serialization in message, which 
would cause the message to get dropped; this presented itself as a timeout if I 
remember properly (Jon Meredith or Yifan Cai fixed this bug I believe).

> On Nov 3, 2021, at 10:56 AM, Brandon Williams  wrote:
> 
> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org  
> wrote:
>> 
>> The largest number of test failures turn out (as pointed out by David) to be 
>> due to how arcane it was to trigger the full test suite. Hopefully we can 
>> get on top of that, but I think a significant remaining issue is a lack of 
>> trust in the output of CI. It’s hard to gate commit on a clean CI run when 
>> there’s flaky tests, and it doesn’t take much to misattribute one failing 
>> test to the existing flakiness (I tend to compare to a run of the trunk 
>> baseline for comparison, but this is burdensome and still error prone). The 
>> more flaky tests there are the more likely this is.
>> 
>> This is in my opinion the real cost of flaky tests, and it’s probably worth 
>> trying to crack down on them hard if we can. It’s possible the Simulator may 
>> help here, when I finally finish it up, as we can port flaky tests to run 
>> with the Simulator and the failing seed can then be explored 
>> deterministically (all being well).
> 
> I totally agree that the lack of trust is a driving problem here, even
> in knowing which CI system to rely on. When Jenkins broke but Circle
> was fine, we all assumed it was a problem with Jenkins, right up until
> Circle also broke.
> 
> In testing a distributed system like this I think we're always going
> to have failures, even on non-flaky tests, simply because the
> underlying infrastructure is variable with transient failures of its
> own (the network is reliable!)  We can fix the flakies where the fault
> is in the code (and we've done this to many already) but to get more
> trustworthy output, I think we're going to need a system that
> understands the difference between success, failure, and timeouts, and
> in the latter case knows how to at least mark them differently.
> Simulator may help, as do the in-jvm dtests, but there is ultimately
> no way to cover everything without doing some things the hard, more
> realistic way where sometimes shit happens, marring the almost-perfect
> runs with noisy doubt, which then has to be sifted through to
> determine if there was a real issue.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Wed, Nov 3, 2021 at 1:26 PM David Capwell  wrote:

> >  I think we're going to need a system that
> > understands the difference between success, failure, and timeouts
>
>
> I am curious how this system can know that the timeout is not an actual 
> failure.  There was a bug in 4.0 with time serialization in message, which 
> would cause the message to get dropped; this presented itself as a timeout if 
> I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe).

I don't think it needs to understand the cause of the timeout, just be
able to differentiate.  Of course some bugs present as timeouts so an
eye will need to be kept on that, but test history can make that
simple.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
Hi,

We see a lot of cases out there when a node was down for longer than
the GC period and once that node is up there are a lot of zombie data
issues ... you know the story.

We would like to implement some kind of a check which would detect
this so that node would not start in the first place so no issues
would be there at all and it would be up to operators to figure out
first what to do with it.

There are a couple of ideas we were exploring with various pros and
cons and I would like to know what you think about them.

1) Register a shutdown hook on "drain". This is already there (1).
"drain" method is doing quite a lot of stuff and this is called on
shutdown so our idea is to write a timestamp to system.local into a
new column like "lastly_drained" or something like that and it would
be read on startup.

The disadvantage of this approach, or all approaches via shutdown
hooks, is that it will only react only on SIGTERM and SIGINT. If that
node is killed via SIGKILL, JVM just stops and there is basically
nothing we have any guarantee of that would leave some traces behind.

If it is killed and that value is not overwritten, on the next startup
it might happen that it would be older than 10 days so it will falsely
evaluate it should not be started.

2) Doing this on startup, you would check how old all your sstables
and commit logs are, if no file was modified less than 10 days ago you
would abort start, there is pretty big chance that your node did at
least something in 10 days, there does not need to be anything added
to system tables or similar and it would be just another StartupCheck.

The disadvantage of this is that some dev clusters, for example, may
run more than 10 days and they are just sitting there doing absolutely
nothing at all, nobody interacts with them, nobody is repairing them,
they are just sitting there. So when nobody talks to these nodes, no
files are modified, right?

It seems like there is not a silver bullet here, what is your opinion on this?

Regards

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
The third option would be to have some thread running in the
background "touching" some (empty) marker file, it is the most simple
solution but I do not like the idea of this marker file, it feels
dirty, but hey, while it would be opt-in feature for people knowing
what they want, why not right ...

On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic
 wrote:
>
> Hi,
>
> We see a lot of cases out there when a node was down for longer than
> the GC period and once that node is up there are a lot of zombie data
> issues ... you know the story.
>
> We would like to implement some kind of a check which would detect
> this so that node would not start in the first place so no issues
> would be there at all and it would be up to operators to figure out
> first what to do with it.
>
> There are a couple of ideas we were exploring with various pros and
> cons and I would like to know what you think about them.
>
> 1) Register a shutdown hook on "drain". This is already there (1).
> "drain" method is doing quite a lot of stuff and this is called on
> shutdown so our idea is to write a timestamp to system.local into a
> new column like "lastly_drained" or something like that and it would
> be read on startup.
>
> The disadvantage of this approach, or all approaches via shutdown
> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> node is killed via SIGKILL, JVM just stops and there is basically
> nothing we have any guarantee of that would leave some traces behind.
>
> If it is killed and that value is not overwritten, on the next startup
> it might happen that it would be older than 10 days so it will falsely
> evaluate it should not be started.
>
> 2) Doing this on startup, you would check how old all your sstables
> and commit logs are, if no file was modified less than 10 days ago you
> would abort start, there is pretty big chance that your node did at
> least something in 10 days, there does not need to be anything added
> to system tables or similar and it would be just another StartupCheck.
>
> The disadvantage of this is that some dev clusters, for example, may
> run more than 10 days and they are just sitting there doing absolutely
> nothing at all, nobody interacts with them, nobody is repairing them,
> they are just sitting there. So when nobody talks to these nodes, no
> files are modified, right?
>
> It seems like there is not a silver bullet here, what is your opinion on this?
>
> Regards
>
> (1) 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Paulo Motta
How about a last_checkpoint (or better name) system.local column that is
updated periodically (ie. every minute) + on drain? This would give a lower
time bound on when the node was last live without requiring an external
marker file.

On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:

> The third option would be to have some thread running in the
> background "touching" some (empty) marker file, it is the most simple
> solution but I do not like the idea of this marker file, it feels
> dirty, but hey, while it would be opt-in feature for people knowing
> what they want, why not right ...
>
> On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic
>  wrote:
> >
> > Hi,
> >
> > We see a lot of cases out there when a node was down for longer than
> > the GC period and once that node is up there are a lot of zombie data
> > issues ... you know the story.
> >
> > We would like to implement some kind of a check which would detect
> > this so that node would not start in the first place so no issues
> > would be there at all and it would be up to operators to figure out
> > first what to do with it.
> >
> > There are a couple of ideas we were exploring with various pros and
> > cons and I would like to know what you think about them.
> >
> > 1) Register a shutdown hook on "drain". This is already there (1).
> > "drain" method is doing quite a lot of stuff and this is called on
> > shutdown so our idea is to write a timestamp to system.local into a
> > new column like "lastly_drained" or something like that and it would
> > be read on startup.
> >
> > The disadvantage of this approach, or all approaches via shutdown
> > hooks, is that it will only react only on SIGTERM and SIGINT. If that
> > node is killed via SIGKILL, JVM just stops and there is basically
> > nothing we have any guarantee of that would leave some traces behind.
> >
> > If it is killed and that value is not overwritten, on the next startup
> > it might happen that it would be older than 10 days so it will falsely
> > evaluate it should not be started.
> >
> > 2) Doing this on startup, you would check how old all your sstables
> > and commit logs are, if no file was modified less than 10 days ago you
> > would abort start, there is pretty big chance that your node did at
> > least something in 10 days, there does not need to be anything added
> > to system tables or similar and it would be just another StartupCheck.
> >
> > The disadvantage of this is that some dev clusters, for example, may
> > run more than 10 days and they are just sitting there doing absolutely
> > nothing at all, nobody interacts with them, nobody is repairing them,
> > they are just sitting there. So when nobody talks to these nodes, no
> > files are modified, right?
> >
> > It seems like there is not a silver bullet here, what is your opinion on
> this?
> >
> > Regards
> >
> > (1)
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
Yes this is the combination of system.local and "marker file"
approach, basically updating that field periodically.

However, when there is a mutation done against the system table (in
this example), it goes to a commit log and then it will be propagated
to sstable on disk, no? So in our hypothetical scenario, if a node is
not touched by anybody, it would still behave like it _does_
something. I would expect that if nobody talks to a node and no
operation is running, it does not produce any "side effects".

I just do not want to generate any unnecessary noise. A node which
does not do anything should not change its data. I am not sure if it
is like that already or if an inactive node still does writes new
sstables after some time, I doubt that.

On Wed, 3 Nov 2021 at 22:58, Paulo Motta  wrote:
>
> How about a last_checkpoint (or better name) system.local column that is
> updated periodically (ie. every minute) + on drain? This would give a lower
> time bound on when the node was last live without requiring an external
> marker file.
>
> On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
> > The third option would be to have some thread running in the
> > background "touching" some (empty) marker file, it is the most simple
> > solution but I do not like the idea of this marker file, it feels
> > dirty, but hey, while it would be opt-in feature for people knowing
> > what they want, why not right ...
> >
> > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic
> >  wrote:
> > >
> > > Hi,
> > >
> > > We see a lot of cases out there when a node was down for longer than
> > > the GC period and once that node is up there are a lot of zombie data
> > > issues ... you know the story.
> > >
> > > We would like to implement some kind of a check which would detect
> > > this so that node would not start in the first place so no issues
> > > would be there at all and it would be up to operators to figure out
> > > first what to do with it.
> > >
> > > There are a couple of ideas we were exploring with various pros and
> > > cons and I would like to know what you think about them.
> > >
> > > 1) Register a shutdown hook on "drain". This is already there (1).
> > > "drain" method is doing quite a lot of stuff and this is called on
> > > shutdown so our idea is to write a timestamp to system.local into a
> > > new column like "lastly_drained" or something like that and it would
> > > be read on startup.
> > >
> > > The disadvantage of this approach, or all approaches via shutdown
> > > hooks, is that it will only react only on SIGTERM and SIGINT. If that
> > > node is killed via SIGKILL, JVM just stops and there is basically
> > > nothing we have any guarantee of that would leave some traces behind.
> > >
> > > If it is killed and that value is not overwritten, on the next startup
> > > it might happen that it would be older than 10 days so it will falsely
> > > evaluate it should not be started.
> > >
> > > 2) Doing this on startup, you would check how old all your sstables
> > > and commit logs are, if no file was modified less than 10 days ago you
> > > would abort start, there is pretty big chance that your node did at
> > > least something in 10 days, there does not need to be anything added
> > > to system tables or similar and it would be just another StartupCheck.
> > >
> > > The disadvantage of this is that some dev clusters, for example, may
> > > run more than 10 days and they are just sitting there doing absolutely
> > > nothing at all, nobody interacts with them, nobody is repairing them,
> > > they are just sitting there. So when nobody talks to these nodes, no
> > > files are modified, right?
> > >
> > > It seems like there is not a silver bullet here, what is your opinion on
> > this?
> > >
> > > Regards
> > >
> > > (1)
> > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Paulo Motta
> I would expect that if nobody talks to a node and no operation is
running, it does not produce any "side effects".

In order to track the last checkpoint timestamp you need to persist it
periodically to prevent against losing state during an ungraceful shutdown
(ie. kill -9).

However you're right this may generate tons of sstables if we're persisting
it periodically to a system table, even if we skip the commit log. We could
tune system.local compaction to use LCS but it would still generate
periodic compaction activity.  In this case an external marker file sounds
much simpler and cleaner.

The downsides I see to the marker file approach are:
a) External clients cannot query last checkpoint time easily
b) The state is lost if the marker file is removed.

However we could solve these issues with:
a) exposing the info via a system table
b) fallback to min(last commitlog/sstable timestamp)

I prefer an explicit mechanism to track last checkpoint (ie. marker file)
vs implicit min(last commitlog/sstable timestamp) so we don't create
unnecessary coupling between different subsystems.

Cheers,

Paulo

Em qua., 3 de nov. de 2021 Ă s 19:29, Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> escreveu:

> Yes this is the combination of system.local and "marker file"
> approach, basically updating that field periodically.
>
> However, when there is a mutation done against the system table (in
> this example), it goes to a commit log and then it will be propagated
> to sstable on disk, no? So in our hypothetical scenario, if a node is
> not touched by anybody, it would still behave like it _does_
> something. I would expect that if nobody talks to a node and no
> operation is running, it does not produce any "side effects".
>
> I just do not want to generate any unnecessary noise. A node which
> does not do anything should not change its data. I am not sure if it
> is like that already or if an inactive node still does writes new
> sstables after some time, I doubt that.
>
> On Wed, 3 Nov 2021 at 22:58, Paulo Motta  wrote:
> >
> > How about a last_checkpoint (or better name) system.local column that is
> > updated periodically (ie. every minute) + on drain? This would give a
> lower
> > time bound on when the node was last live without requiring an external
> > marker file.
> >
> > On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic <
> > stefan.mikloso...@instaclustr.com> wrote:
> >
> > > The third option would be to have some thread running in the
> > > background "touching" some (empty) marker file, it is the most simple
> > > solution but I do not like the idea of this marker file, it feels
> > > dirty, but hey, while it would be opt-in feature for people knowing
> > > what they want, why not right ...
> > >
> > > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > We see a lot of cases out there when a node was down for longer than
> > > > the GC period and once that node is up there are a lot of zombie data
> > > > issues ... you know the story.
> > > >
> > > > We would like to implement some kind of a check which would detect
> > > > this so that node would not start in the first place so no issues
> > > > would be there at all and it would be up to operators to figure out
> > > > first what to do with it.
> > > >
> > > > There are a couple of ideas we were exploring with various pros and
> > > > cons and I would like to know what you think about them.
> > > >
> > > > 1) Register a shutdown hook on "drain". This is already there (1).
> > > > "drain" method is doing quite a lot of stuff and this is called on
> > > > shutdown so our idea is to write a timestamp to system.local into a
> > > > new column like "lastly_drained" or something like that and it would
> > > > be read on startup.
> > > >
> > > > The disadvantage of this approach, or all approaches via shutdown
> > > > hooks, is that it will only react only on SIGTERM and SIGINT. If that
> > > > node is killed via SIGKILL, JVM just stops and there is basically
> > > > nothing we have any guarantee of that would leave some traces behind.
> > > >
> > > > If it is killed and that value is not overwritten, on the next
> startup
> > > > it might happen that it would be older than 10 days so it will
> falsely
> > > > evaluate it should not be started.
> > > >
> > > > 2) Doing this on startup, you would check how old all your sstables
> > > > and commit logs are, if no file was modified less than 10 days ago
> you
> > > > would abort start, there is pretty big chance that your node did at
> > > > least something in 10 days, there does not need to be anything added
> > > > to system tables or similar and it would be just another
> StartupCheck.
> > > >
> > > > The disadvantage of this is that some dev clusters, for example, may
> > > > run more than 10 days and they are just sitting there doing
> absolutely
> > > > nothing at all, nobody interacts with them, nobody is repairing them,
> > > > they are just sitting there. 

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Berenguer Blasi
I agree with David. CI has been pretty reliable besides the random
jenkins going down or timeout. The same 3 or 4 tests were the only flaky
ones in jenkins and Circle was very green. I bisected a couple failures
to legit code errors, David is fixing some more, others have as well, etc

It is good news imo as we're just getting to learn our CI post 4.0 is
reliable and we need to start treating it as so and paying attention to
it's reports. Not perfect but reliable enough it would have prevented
those bugs getting merged.

In fact we're having this conversation bc we noticed CI going from a
steady 3-ish failures to many and it's getting fixed. So we're moving in
the right direction imo.

On 3/11/21 19:25, David Capwell wrote:
>> It’s hard to gate commit on a clean CI run when there’s flaky tests
> I agree, this is also why so much effort was done in 4.0 release to remove as 
> much as possible.  Just over 1 month ago we were not really having a flaky 
> test issue (outside of the sporadic timeout issues; my circle ci runs were 
> green constantly), and now the “flaky tests” I see are all actual bugs (been 
> root causing 2 out of the 3 I reported) and some (not all) of the flakyness 
> was triggered by recent changes in the past month.
>
> Right now people do not believe the failing test is caused by their patch and 
> attribute to flakiness, which then causes the builds to start being flaky, 
> which then leads to a different author coming to fix the issue; this behavior 
> is what I would love to see go away.  If we find a flaky test, we should do 
> the following
>
> 1) has it already been reported and who is working to fix?  Can we block this 
> patch on the test being fixed?  Flaky tests due to timing issues normally are 
> resolved very quickly, real bugs take longer.
> 2) if not reported, why?  If you are the first to see this issue than good 
> chance the patch caused the issue so should root cause.  If you are not the 
> first to see it, why did others not report it (we tend to be good about this, 
> even to the point Brandon has to mark the new tickets as dups…)?
>
> I have committed when there were flakiness, and I have caused flakiness; not 
> saying I am perfect or that I do the above, just saying that if we all moved 
> to the above model we could start relying on CI.  The biggest impact to our 
> stability is people actually root causing flaky tests.
>
>>  I think we're going to need a system that
>> understands the difference between success, failure, and timeouts
>
> I am curious how this system can know that the timeout is not an actual 
> failure.  There was a bug in 4.0 with time serialization in message, which 
> would cause the message to get dropped; this presented itself as a timeout if 
> I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe).
>
>> On Nov 3, 2021, at 10:56 AM, Brandon Williams  wrote:
>>
>> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org  
>> wrote:
>>> The largest number of test failures turn out (as pointed out by David) to 
>>> be due to how arcane it was to trigger the full test suite. Hopefully we 
>>> can get on top of that, but I think a significant remaining issue is a lack 
>>> of trust in the output of CI. It’s hard to gate commit on a clean CI run 
>>> when there’s flaky tests, and it doesn’t take much to misattribute one 
>>> failing test to the existing flakiness (I tend to compare to a run of the 
>>> trunk baseline for comparison, but this is burdensome and still error 
>>> prone). The more flaky tests there are the more likely this is.
>>>
>>> This is in my opinion the real cost of flaky tests, and it’s probably worth 
>>> trying to crack down on them hard if we can. It’s possible the Simulator 
>>> may help here, when I finally finish it up, as we can port flaky tests to 
>>> run with the Simulator and the failing seed can then be explored 
>>> deterministically (all being well).
>> I totally agree that the lack of trust is a driving problem here, even
>> in knowing which CI system to rely on. When Jenkins broke but Circle
>> was fine, we all assumed it was a problem with Jenkins, right up until
>> Circle also broke.
>>
>> In testing a distributed system like this I think we're always going
>> to have failures, even on non-flaky tests, simply because the
>> underlying infrastructure is variable with transient failures of its
>> own (the network is reliable!)  We can fix the flakies where the fault
>> is in the code (and we've done this to many already) but to get more
>> trustworthy output, I think we're going to need a system that
>> understands the difference between success, failure, and timeouts, and
>> in the latter case knows how to at least mark them differently.
>> Simulator may help, as do the in-jvm dtests, but there is ultimately
>> no way to cover everything without doing some things the hard, more
>> realistic way where sometimes shit happens, marring the almost-perfect
>> runs with noisy doubt, which then has to be

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Berenguer Blasi
What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.

On 3/11/21 21:53, Stefan Miklosovic wrote:
> Hi,
>
> We see a lot of cases out there when a node was down for longer than
> the GC period and once that node is up there are a lot of zombie data
> issues ... you know the story.
>
> We would like to implement some kind of a check which would detect
> this so that node would not start in the first place so no issues
> would be there at all and it would be up to operators to figure out
> first what to do with it.
>
> There are a couple of ideas we were exploring with various pros and
> cons and I would like to know what you think about them.
>
> 1) Register a shutdown hook on "drain". This is already there (1).
> "drain" method is doing quite a lot of stuff and this is called on
> shutdown so our idea is to write a timestamp to system.local into a
> new column like "lastly_drained" or something like that and it would
> be read on startup.
>
> The disadvantage of this approach, or all approaches via shutdown
> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> node is killed via SIGKILL, JVM just stops and there is basically
> nothing we have any guarantee of that would leave some traces behind.
>
> If it is killed and that value is not overwritten, on the next startup
> it might happen that it would be older than 10 days so it will falsely
> evaluate it should not be started.
>
> 2) Doing this on startup, you would check how old all your sstables
> and commit logs are, if no file was modified less than 10 days ago you
> would abort start, there is pretty big chance that your node did at
> least something in 10 days, there does not need to be anything added
> to system tables or similar and it would be just another StartupCheck.
>
> The disadvantage of this is that some dev clusters, for example, may
> run more than 10 days and they are just sitting there doing absolutely
> nothing at all, nobody interacts with them, nobody is repairing them,
> they are just sitting there. So when nobody talks to these nodes, no
> files are modified, right?
>
> It seems like there is not a silver bullet here, what is your opinion on this?
>
> Regards
>
> (1) 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
> .

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Berenguer Blasi
Apologies, I missed Paulo's reply on my email client threading funnies...

On 4/11/21 7:50, Berenguer Blasi wrote:
> What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
>
> On 3/11/21 21:53, Stefan Miklosovic wrote:
>> Hi,
>>
>> We see a lot of cases out there when a node was down for longer than
>> the GC period and once that node is up there are a lot of zombie data
>> issues ... you know the story.
>>
>> We would like to implement some kind of a check which would detect
>> this so that node would not start in the first place so no issues
>> would be there at all and it would be up to operators to figure out
>> first what to do with it.
>>
>> There are a couple of ideas we were exploring with various pros and
>> cons and I would like to know what you think about them.
>>
>> 1) Register a shutdown hook on "drain". This is already there (1).
>> "drain" method is doing quite a lot of stuff and this is called on
>> shutdown so our idea is to write a timestamp to system.local into a
>> new column like "lastly_drained" or something like that and it would
>> be read on startup.
>>
>> The disadvantage of this approach, or all approaches via shutdown
>> hooks, is that it will only react only on SIGTERM and SIGINT. If that
>> node is killed via SIGKILL, JVM just stops and there is basically
>> nothing we have any guarantee of that would leave some traces behind.
>>
>> If it is killed and that value is not overwritten, on the next startup
>> it might happen that it would be older than 10 days so it will falsely
>> evaluate it should not be started.
>>
>> 2) Doing this on startup, you would check how old all your sstables
>> and commit logs are, if no file was modified less than 10 days ago you
>> would abort start, there is pretty big chance that your node did at
>> least something in 10 days, there does not need to be anything added
>> to system tables or similar and it would be just another StartupCheck.
>>
>> The disadvantage of this is that some dev clusters, for example, may
>> run more than 10 days and they are just sitting there doing absolutely
>> nothing at all, nobody interacts with them, nobody is repairing them,
>> they are just sitting there. So when nobody talks to these nodes, no
>> files are modified, right?
>>
>> It seems like there is not a silver bullet here, what is your opinion on 
>> this?
>>
>> Regards
>>
>> (1) 
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>> .

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org