Re: [DISCUSS] Releasable trunk and quality
I'll merge 16262 and the Harry blog-post that accompanies it shortly. Having 16262 merged will significantly reduce the amount of resistance one has to overcome in order to write a fuzz test. But this, of course, only covers short/small/unit-test-like tests. For longer running tests, I guess for now we will have to rely on folks (hopefully) running long fuzz tests and reporting issues. But eventually it'd be great to have enough automation around it so that anyone could do that and where test results are public. In regard to long-running tests, currently with Harry we can run three kinds of long-running tests: 1. Stress-like concurrent write workload, followed by periods of quiescence and then validation 2. Writes with injected faults, followed by repair and validation 3. Stress-like concurrent read/write workload with fault injection without validation, for finding rare edge conditions / triggering possible exceptions Which means that quorum read and write paths (for all kinds of schemas, including all possible kinds of read and write queries), compactions, repairs, read-repairs and hints are covered fairly well. However things like bootstrap and other kinds of range movements aren't. It'd be great to expand this, but it's been somewhat difficult to do, since last time a bootstrap test was attempted, it has immediately uncovered enough issues to keep us busy fixing them for quite some time. Maybe it's about time to try that again. For short tests, you can think of Harry as a tool to save you time and allow focusing on higher-level test meaning rather than creating schema and coming up with specific values to insert/select. Thanks --Alex On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova wrote: > Did I hear my name? đ > Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs > discovered by pflaky tests and one of them was very hard to hit. So 6-7%. I > think that report I sent back then didnât come through so the topic was > cleared in a follow up mail by Benjamin; with a lot of sweat but we kept to > the promised 4.0 standard. > > Now back to this topic: > - green CI without enough test coverage is nothing more than green CI > unfortunately to me. I know this is an elephant but I wonât sleep well > tonight if I donât mention it. > - I believe the looping of tests mentioned by Berenguer can help for > verifying no new weird flakiness is introduced by new tests added. And of > course it helps a lot during fixing flaky tests, I think thatâs clear. > > I think that it would be great if each such test > > > (or > > > > test group) was guaranteed to have a ticket and some preliminary > > analysis > > > > was done to confirm it is just a test problem before releasing the > new > > > > version > > Probably not bad idea. Preliminary analysis. But we need to get into the > cadence of regular checking our CI; divide and conquer on regular basis > between all of us. Not to mention it is way easier to follow up recently > introduced issues with the people who worked on stuff then trying to find > out what happened a year ago in a rush before a release. I agree it is not > about the number but what stays behind it. > > Requiring all tests to run pre every merge, easily we can add this in > circle but there are many people who donât have access to high resources so > again they wonât be able to run absolutely everything. At the end > everything is up to the diligence of the reviewers/committers. Plus > official CI is Jenkins and we know there are different infra related > failures in the different CIs. Not an easy topic, indeed. I support running > all tests, just having in mind all the related issues/complications. > > I would say in my mind upgrade tests are particularly important to be green > before a release, too. > > Seems to me we have the tools, but now it is time to organize the rhythm in > an efficient manner. > > Best regards, > Ekaterina > > > On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie wrote: > > > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some > > analysis and something like 18% (correct me if I'm wrong here) of the > test > > failures we were considering "flaky tests" were actual product defects in > > the database. With that in mind, we should be uncomfortable cutting a > > release if we have 6 test failures since there's every likelihood one of > > them is a surfaced bug. > > > > ensuring our best practices are followed for every merge > > > > I totally agree but I also don't think we have this codified (unless I'm > > just completely missing something - very possible! ;)) Seems like we have > > different circle configs, different sets of jobs being run, Harry / > Hunter > > (maybe?) / ?? run on some but not all commits and/or all branches, > > manual performance testing on specific releases but nothing surfaced > > formally to the project as a reproducible suite like we used to have > years > > ago (primitive though it was at the time with what it c
Re: [DISCUSS] CEP-18: Improving Modularity
It seems like there are many people in this thread that would rather we not make a âgroupingâ CEP for these ideas, but rather consider each one individually. I will close out this CEP thread then, and discussions can continue on individual tickets. I think we got some nice discussion on this thread on what people would like to see from these type of refactoring tickets, so that will be good information to take forward as we work on individual tickets. Thanks for the discussion everyone. -Jeremiah Jordan > On Oct 27, 2021, at 1:28 AM, Dinesh Joshi wrote: > >> On Oct 25, 2021, at 1:22 PM, Jeremiah D Jordan wrote: >> >> The currently proposed changes in CEP-18 should all include improved test >> coverage of the modules in question. We have been developing them all with >> a requirement that all changes have at least %80 code coverage from sonar >> cloud jacoco reports. We have also found and fixed some bugs in the >> existing code during this development work. > > This is great! We, as a project, should encourage improved test code > coverage. So I welcome this change. > >> So do people feel we should re-propose these as multiple CEPâs or just >> tickets? Or do people prefer to have a discussion/vote on the idea of >> improving the modularity of the code base in general? > > My personal preference would be to see this work appear as individual CEPs or > even JIRA tickets with discussions but definitely not one giant CEP that is > pulling together a lot of different changes. > > I really like the idea of building pluggable modular components. However, I > am concerned about few things. > > 1. Performance regression. > 2. Breaking backward compatibility for our users & tools. > 3. Interfaces with single implementation. > > I would like to ensure that we are mindful of these concerns while making big > refactors. > > Thanks, > > Dinesh - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] Releasable trunk and quality
> > It'd be great to > expand this, but it's been somewhat difficult to do, since last time a > bootstrap test was attempted, it has immediately uncovered enough issues to > keep us busy fixing them for quite some time. Maybe it's about time to try > that again. I'm going to go with a "yes please". :) On Wed, Nov 3, 2021 at 9:27 AM Oleksandr Petrov wrote: > I'll merge 16262 and the Harry blog-post that accompanies it shortly. > Having 16262 merged will significantly reduce the amount of resistance one > has to overcome in order to write a fuzz test. But this, of course, only > covers short/small/unit-test-like tests. > > For longer running tests, I guess for now we will have to rely on folks > (hopefully) running long fuzz tests and reporting issues. But eventually > it'd be great to have enough automation around it so that anyone could do > that and where test results are public. > > In regard to long-running tests, currently with Harry we can run three > kinds of long-running tests: > 1. Stress-like concurrent write workload, followed by periods of quiescence > and then validation > 2. Writes with injected faults, followed by repair and validation > 3. Stress-like concurrent read/write workload with fault injection without > validation, for finding rare edge conditions / triggering possible > exceptions > > Which means that quorum read and write paths (for all kinds of schemas, > including all possible kinds of read and write queries), compactions, > repairs, read-repairs and hints are covered fairly well. However things > like bootstrap and other kinds of range movements aren't. It'd be great to > expand this, but it's been somewhat difficult to do, since last time a > bootstrap test was attempted, it has immediately uncovered enough issues to > keep us busy fixing them for quite some time. Maybe it's about time to try > that again. > > For short tests, you can think of Harry as a tool to save you time and > allow focusing on higher-level test meaning rather than creating schema and > coming up with specific values to insert/select. > > Thanks > --Alex > > > > On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova > wrote: > > > Did I hear my name? đ > > Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs > > discovered by pflaky tests and one of them was very hard to hit. So > 6-7%. I > > think that report I sent back then didnât come through so the topic was > > cleared in a follow up mail by Benjamin; with a lot of sweat but we kept > to > > the promised 4.0 standard. > > > > Now back to this topic: > > - green CI without enough test coverage is nothing more than green CI > > unfortunately to me. I know this is an elephant but I wonât sleep well > > tonight if I donât mention it. > > - I believe the looping of tests mentioned by Berenguer can help for > > verifying no new weird flakiness is introduced by new tests added. And of > > course it helps a lot during fixing flaky tests, I think thatâs clear. > > > > I think that it would be great if each such test > > > > (or > > > > > test group) was guaranteed to have a ticket and some preliminary > > > analysis > > > > > was done to confirm it is just a test problem before releasing the > > new > > > > > version > > > > Probably not bad idea. Preliminary analysis. But we need to get into the > > cadence of regular checking our CI; divide and conquer on regular basis > > between all of us. Not to mention it is way easier to follow up recently > > introduced issues with the people who worked on stuff then trying to find > > out what happened a year ago in a rush before a release. I agree it is > not > > about the number but what stays behind it. > > > > Requiring all tests to run pre every merge, easily we can add this in > > circle but there are many people who donât have access to high resources > so > > again they wonât be able to run absolutely everything. At the end > > everything is up to the diligence of the reviewers/committers. Plus > > official CI is Jenkins and we know there are different infra related > > failures in the different CIs. Not an easy topic, indeed. I support > running > > all tests, just having in mind all the related issues/complications. > > > > I would say in my mind upgrade tests are particularly important to be > green > > before a release, too. > > > > Seems to me we have the tools, but now it is time to organize the rhythm > in > > an efficient manner. > > > > Best regards, > > Ekaterina > > > > > > On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie > wrote: > > > > > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some > > > analysis and something like 18% (correct me if I'm wrong here) of the > > test > > > failures we were considering "flaky tests" were actual product defects > in > > > the database. With that in mind, we should be uncomfortable cutting a > > > release if we have 6 test failures since there's every likelihood one > of > > > them is a surfaced bug. > > > > > > ensuring o
Re: [DISCUSS] Releasable trunk and quality
On Mon, Nov 1, 2021 at 5:03 PM David Capwell wrote: > > > How do we define what "releasable trunk" means? > > One thing I would love is for us to adopt a ârun all tests needed to release > before commitâ mentality, and to link a successful run in JIRA when closing > (we talked about this once in slack). If we look at CircleCI we currently do > not run all the tests needed to sign off; below are the tests disabled in the > âpre-commitâ workflows (see > https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381): A good first step toward this would be for us to treat our binding +1s more judiciously, and not grant any without at least a pre-commit CI run linked in the ticket. You don't have to look very hard to find a lot of these today (I know I'm guilty), and it's possible we wouldn't have the current CI mess now if we had been a little bit more diligent. - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] Releasable trunk and quality
The largest number of test failures turn out (as pointed out by David) to be due to how arcane it was to trigger the full test suite. Hopefully we can get on top of that, but I think a significant remaining issue is a lack of trust in the output of CI. Itâs hard to gate commit on a clean CI run when thereâs flaky tests, and it doesnât take much to misattribute one failing test to the existing flakiness (I tend to compare to a run of the trunk baseline for comparison, but this is burdensome and still error prone). The more flaky tests there are the more likely this is. This is in my opinion the real cost of flaky tests, and itâs probably worth trying to crack down on them hard if we can. Itâs possible the Simulator may help here, when I finally finish it up, as we can port flaky tests to run with the Simulator and the failing seed can then be explored deterministically (all being well). From: Brandon Williams Date: Wednesday, 3 November 2021 at 17:07 To: dev@cassandra.apache.org Subject: Re: [DISCUSS] Releasable trunk and quality On Mon, Nov 1, 2021 at 5:03 PM David Capwell wrote: > > > How do we define what "releasable trunk" means? > > One thing I would love is for us to adopt a ârun all tests needed to release > before commitâ mentality, and to link a successful run in JIRA when closing > (we talked about this once in slack). If we look at CircleCI we currently do > not run all the tests needed to sign off; below are the tests disabled in the > âpre-commitâ workflows (see > https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381): A good first step toward this would be for us to treat our binding +1s more judiciously, and not grant any without at least a pre-commit CI run linked in the ticket. You don't have to look very hard to find a lot of these today (I know I'm guilty), and it's possible we wouldn't have the current CI mess now if we had been a little bit more diligent. - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] Releasable trunk and quality
On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org wrote: > > The largest number of test failures turn out (as pointed out by David) to be > due to how arcane it was to trigger the full test suite. Hopefully we can get > on top of that, but I think a significant remaining issue is a lack of trust > in the output of CI. Itâs hard to gate commit on a clean CI run when thereâs > flaky tests, and it doesnât take much to misattribute one failing test to the > existing flakiness (I tend to compare to a run of the trunk baseline for > comparison, but this is burdensome and still error prone). The more flaky > tests there are the more likely this is. > > This is in my opinion the real cost of flaky tests, and itâs probably worth > trying to crack down on them hard if we can. Itâs possible the Simulator may > help here, when I finally finish it up, as we can port flaky tests to run > with the Simulator and the failing seed can then be explored > deterministically (all being well). I totally agree that the lack of trust is a driving problem here, even in knowing which CI system to rely on. When Jenkins broke but Circle was fine, we all assumed it was a problem with Jenkins, right up until Circle also broke. In testing a distributed system like this I think we're always going to have failures, even on non-flaky tests, simply because the underlying infrastructure is variable with transient failures of its own (the network is reliable!) We can fix the flakies where the fault is in the code (and we've done this to many already) but to get more trustworthy output, I think we're going to need a system that understands the difference between success, failure, and timeouts, and in the latter case knows how to at least mark them differently. Simulator may help, as do the in-jvm dtests, but there is ultimately no way to cover everything without doing some things the hard, more realistic way where sometimes shit happens, marring the almost-perfect runs with noisy doubt, which then has to be sifted through to determine if there was a real issue. - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] Releasable trunk and quality
> Itâs hard to gate commit on a clean CI run when thereâs flaky tests I agree, this is also why so much effort was done in 4.0 release to remove as much as possible. Just over 1 month ago we were not really having a flaky test issue (outside of the sporadic timeout issues; my circle ci runs were green constantly), and now the âflaky testsâ I see are all actual bugs (been root causing 2 out of the 3 I reported) and some (not all) of the flakyness was triggered by recent changes in the past month. Right now people do not believe the failing test is caused by their patch and attribute to flakiness, which then causes the builds to start being flaky, which then leads to a different author coming to fix the issue; this behavior is what I would love to see go away. If we find a flaky test, we should do the following 1) has it already been reported and who is working to fix? Can we block this patch on the test being fixed? Flaky tests due to timing issues normally are resolved very quickly, real bugs take longer. 2) if not reported, why? If you are the first to see this issue than good chance the patch caused the issue so should root cause. If you are not the first to see it, why did others not report it (we tend to be good about this, even to the point Brandon has to mark the new tickets as dupsâŚ)? I have committed when there were flakiness, and I have caused flakiness; not saying I am perfect or that I do the above, just saying that if we all moved to the above model we could start relying on CI. The biggest impact to our stability is people actually root causing flaky tests. > I think we're going to need a system that > understands the difference between success, failure, and timeouts I am curious how this system can know that the timeout is not an actual failure. There was a bug in 4.0 with time serialization in message, which would cause the message to get dropped; this presented itself as a timeout if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). > On Nov 3, 2021, at 10:56 AM, Brandon Williams wrote: > > On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org > wrote: >> >> The largest number of test failures turn out (as pointed out by David) to be >> due to how arcane it was to trigger the full test suite. Hopefully we can >> get on top of that, but I think a significant remaining issue is a lack of >> trust in the output of CI. Itâs hard to gate commit on a clean CI run when >> thereâs flaky tests, and it doesnât take much to misattribute one failing >> test to the existing flakiness (I tend to compare to a run of the trunk >> baseline for comparison, but this is burdensome and still error prone). The >> more flaky tests there are the more likely this is. >> >> This is in my opinion the real cost of flaky tests, and itâs probably worth >> trying to crack down on them hard if we can. Itâs possible the Simulator may >> help here, when I finally finish it up, as we can port flaky tests to run >> with the Simulator and the failing seed can then be explored >> deterministically (all being well). > > I totally agree that the lack of trust is a driving problem here, even > in knowing which CI system to rely on. When Jenkins broke but Circle > was fine, we all assumed it was a problem with Jenkins, right up until > Circle also broke. > > In testing a distributed system like this I think we're always going > to have failures, even on non-flaky tests, simply because the > underlying infrastructure is variable with transient failures of its > own (the network is reliable!) We can fix the flakies where the fault > is in the code (and we've done this to many already) but to get more > trustworthy output, I think we're going to need a system that > understands the difference between success, failure, and timeouts, and > in the latter case knows how to at least mark them differently. > Simulator may help, as do the in-jvm dtests, but there is ultimately > no way to cover everything without doing some things the hard, more > realistic way where sometimes shit happens, marring the almost-perfect > runs with noisy doubt, which then has to be sifted through to > determine if there was a real issue. > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] Releasable trunk and quality
On Wed, Nov 3, 2021 at 1:26 PM David Capwell wrote: > > I think we're going to need a system that > > understands the difference between success, failure, and timeouts > > > I am curious how this system can know that the timeout is not an actual > failure. There was a bug in 4.0 with time serialization in message, which > would cause the message to get dropped; this presented itself as a timeout if > I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). I don't think it needs to understand the cause of the timeout, just be able to differentiate. Of course some bugs present as timeouts so an eye will need to be kept on that, but test history can make that simple. - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
The most reliable way to determine the last time node was up
Hi, We see a lot of cases out there when a node was down for longer than the GC period and once that node is up there are a lot of zombie data issues ... you know the story. We would like to implement some kind of a check which would detect this so that node would not start in the first place so no issues would be there at all and it would be up to operators to figure out first what to do with it. There are a couple of ideas we were exploring with various pros and cons and I would like to know what you think about them. 1) Register a shutdown hook on "drain". This is already there (1). "drain" method is doing quite a lot of stuff and this is called on shutdown so our idea is to write a timestamp to system.local into a new column like "lastly_drained" or something like that and it would be read on startup. The disadvantage of this approach, or all approaches via shutdown hooks, is that it will only react only on SIGTERM and SIGINT. If that node is killed via SIGKILL, JVM just stops and there is basically nothing we have any guarantee of that would leave some traces behind. If it is killed and that value is not overwritten, on the next startup it might happen that it would be older than 10 days so it will falsely evaluate it should not be started. 2) Doing this on startup, you would check how old all your sstables and commit logs are, if no file was modified less than 10 days ago you would abort start, there is pretty big chance that your node did at least something in 10 days, there does not need to be anything added to system tables or similar and it would be just another StartupCheck. The disadvantage of this is that some dev clusters, for example, may run more than 10 days and they are just sitting there doing absolutely nothing at all, nobody interacts with them, nobody is repairing them, they are just sitting there. So when nobody talks to these nodes, no files are modified, right? It seems like there is not a silver bullet here, what is your opinion on this? Regards (1) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: The most reliable way to determine the last time node was up
The third option would be to have some thread running in the background "touching" some (empty) marker file, it is the most simple solution but I do not like the idea of this marker file, it feels dirty, but hey, while it would be opt-in feature for people knowing what they want, why not right ... On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic wrote: > > Hi, > > We see a lot of cases out there when a node was down for longer than > the GC period and once that node is up there are a lot of zombie data > issues ... you know the story. > > We would like to implement some kind of a check which would detect > this so that node would not start in the first place so no issues > would be there at all and it would be up to operators to figure out > first what to do with it. > > There are a couple of ideas we were exploring with various pros and > cons and I would like to know what you think about them. > > 1) Register a shutdown hook on "drain". This is already there (1). > "drain" method is doing quite a lot of stuff and this is called on > shutdown so our idea is to write a timestamp to system.local into a > new column like "lastly_drained" or something like that and it would > be read on startup. > > The disadvantage of this approach, or all approaches via shutdown > hooks, is that it will only react only on SIGTERM and SIGINT. If that > node is killed via SIGKILL, JVM just stops and there is basically > nothing we have any guarantee of that would leave some traces behind. > > If it is killed and that value is not overwritten, on the next startup > it might happen that it would be older than 10 days so it will falsely > evaluate it should not be started. > > 2) Doing this on startup, you would check how old all your sstables > and commit logs are, if no file was modified less than 10 days ago you > would abort start, there is pretty big chance that your node did at > least something in 10 days, there does not need to be anything added > to system tables or similar and it would be just another StartupCheck. > > The disadvantage of this is that some dev clusters, for example, may > run more than 10 days and they are just sitting there doing absolutely > nothing at all, nobody interacts with them, nobody is repairing them, > they are just sitting there. So when nobody talks to these nodes, no > files are modified, right? > > It seems like there is not a silver bullet here, what is your opinion on this? > > Regards > > (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: The most reliable way to determine the last time node was up
How about a last_checkpoint (or better name) system.local column that is updated periodically (ie. every minute) + on drain? This would give a lower time bound on when the node was last live without requiring an external marker file. On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic < stefan.mikloso...@instaclustr.com> wrote: > The third option would be to have some thread running in the > background "touching" some (empty) marker file, it is the most simple > solution but I do not like the idea of this marker file, it feels > dirty, but hey, while it would be opt-in feature for people knowing > what they want, why not right ... > > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic > wrote: > > > > Hi, > > > > We see a lot of cases out there when a node was down for longer than > > the GC period and once that node is up there are a lot of zombie data > > issues ... you know the story. > > > > We would like to implement some kind of a check which would detect > > this so that node would not start in the first place so no issues > > would be there at all and it would be up to operators to figure out > > first what to do with it. > > > > There are a couple of ideas we were exploring with various pros and > > cons and I would like to know what you think about them. > > > > 1) Register a shutdown hook on "drain". This is already there (1). > > "drain" method is doing quite a lot of stuff and this is called on > > shutdown so our idea is to write a timestamp to system.local into a > > new column like "lastly_drained" or something like that and it would > > be read on startup. > > > > The disadvantage of this approach, or all approaches via shutdown > > hooks, is that it will only react only on SIGTERM and SIGINT. If that > > node is killed via SIGKILL, JVM just stops and there is basically > > nothing we have any guarantee of that would leave some traces behind. > > > > If it is killed and that value is not overwritten, on the next startup > > it might happen that it would be older than 10 days so it will falsely > > evaluate it should not be started. > > > > 2) Doing this on startup, you would check how old all your sstables > > and commit logs are, if no file was modified less than 10 days ago you > > would abort start, there is pretty big chance that your node did at > > least something in 10 days, there does not need to be anything added > > to system tables or similar and it would be just another StartupCheck. > > > > The disadvantage of this is that some dev clusters, for example, may > > run more than 10 days and they are just sitting there doing absolutely > > nothing at all, nobody interacts with them, nobody is repairing them, > > they are just sitting there. So when nobody talks to these nodes, no > > files are modified, right? > > > > It seems like there is not a silver bullet here, what is your opinion on > this? > > > > Regards > > > > (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >
Re: The most reliable way to determine the last time node was up
Yes this is the combination of system.local and "marker file" approach, basically updating that field periodically. However, when there is a mutation done against the system table (in this example), it goes to a commit log and then it will be propagated to sstable on disk, no? So in our hypothetical scenario, if a node is not touched by anybody, it would still behave like it _does_ something. I would expect that if nobody talks to a node and no operation is running, it does not produce any "side effects". I just do not want to generate any unnecessary noise. A node which does not do anything should not change its data. I am not sure if it is like that already or if an inactive node still does writes new sstables after some time, I doubt that. On Wed, 3 Nov 2021 at 22:58, Paulo Motta wrote: > > How about a last_checkpoint (or better name) system.local column that is > updated periodically (ie. every minute) + on drain? This would give a lower > time bound on when the node was last live without requiring an external > marker file. > > On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic < > stefan.mikloso...@instaclustr.com> wrote: > > > The third option would be to have some thread running in the > > background "touching" some (empty) marker file, it is the most simple > > solution but I do not like the idea of this marker file, it feels > > dirty, but hey, while it would be opt-in feature for people knowing > > what they want, why not right ... > > > > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic > > wrote: > > > > > > Hi, > > > > > > We see a lot of cases out there when a node was down for longer than > > > the GC period and once that node is up there are a lot of zombie data > > > issues ... you know the story. > > > > > > We would like to implement some kind of a check which would detect > > > this so that node would not start in the first place so no issues > > > would be there at all and it would be up to operators to figure out > > > first what to do with it. > > > > > > There are a couple of ideas we were exploring with various pros and > > > cons and I would like to know what you think about them. > > > > > > 1) Register a shutdown hook on "drain". This is already there (1). > > > "drain" method is doing quite a lot of stuff and this is called on > > > shutdown so our idea is to write a timestamp to system.local into a > > > new column like "lastly_drained" or something like that and it would > > > be read on startup. > > > > > > The disadvantage of this approach, or all approaches via shutdown > > > hooks, is that it will only react only on SIGTERM and SIGINT. If that > > > node is killed via SIGKILL, JVM just stops and there is basically > > > nothing we have any guarantee of that would leave some traces behind. > > > > > > If it is killed and that value is not overwritten, on the next startup > > > it might happen that it would be older than 10 days so it will falsely > > > evaluate it should not be started. > > > > > > 2) Doing this on startup, you would check how old all your sstables > > > and commit logs are, if no file was modified less than 10 days ago you > > > would abort start, there is pretty big chance that your node did at > > > least something in 10 days, there does not need to be anything added > > > to system tables or similar and it would be just another StartupCheck. > > > > > > The disadvantage of this is that some dev clusters, for example, may > > > run more than 10 days and they are just sitting there doing absolutely > > > nothing at all, nobody interacts with them, nobody is repairing them, > > > they are just sitting there. So when nobody talks to these nodes, no > > > files are modified, right? > > > > > > It seems like there is not a silver bullet here, what is your opinion on > > this? > > > > > > Regards > > > > > > (1) > > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: The most reliable way to determine the last time node was up
> I would expect that if nobody talks to a node and no operation is running, it does not produce any "side effects". In order to track the last checkpoint timestamp you need to persist it periodically to prevent against losing state during an ungraceful shutdown (ie. kill -9). However you're right this may generate tons of sstables if we're persisting it periodically to a system table, even if we skip the commit log. We could tune system.local compaction to use LCS but it would still generate periodic compaction activity. In this case an external marker file sounds much simpler and cleaner. The downsides I see to the marker file approach are: a) External clients cannot query last checkpoint time easily b) The state is lost if the marker file is removed. However we could solve these issues with: a) exposing the info via a system table b) fallback to min(last commitlog/sstable timestamp) I prefer an explicit mechanism to track last checkpoint (ie. marker file) vs implicit min(last commitlog/sstable timestamp) so we don't create unnecessary coupling between different subsystems. Cheers, Paulo Em qua., 3 de nov. de 2021 Ă s 19:29, Stefan Miklosovic < stefan.mikloso...@instaclustr.com> escreveu: > Yes this is the combination of system.local and "marker file" > approach, basically updating that field periodically. > > However, when there is a mutation done against the system table (in > this example), it goes to a commit log and then it will be propagated > to sstable on disk, no? So in our hypothetical scenario, if a node is > not touched by anybody, it would still behave like it _does_ > something. I would expect that if nobody talks to a node and no > operation is running, it does not produce any "side effects". > > I just do not want to generate any unnecessary noise. A node which > does not do anything should not change its data. I am not sure if it > is like that already or if an inactive node still does writes new > sstables after some time, I doubt that. > > On Wed, 3 Nov 2021 at 22:58, Paulo Motta wrote: > > > > How about a last_checkpoint (or better name) system.local column that is > > updated periodically (ie. every minute) + on drain? This would give a > lower > > time bound on when the node was last live without requiring an external > > marker file. > > > > On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic < > > stefan.mikloso...@instaclustr.com> wrote: > > > > > The third option would be to have some thread running in the > > > background "touching" some (empty) marker file, it is the most simple > > > solution but I do not like the idea of this marker file, it feels > > > dirty, but hey, while it would be opt-in feature for people knowing > > > what they want, why not right ... > > > > > > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic > > > wrote: > > > > > > > > Hi, > > > > > > > > We see a lot of cases out there when a node was down for longer than > > > > the GC period and once that node is up there are a lot of zombie data > > > > issues ... you know the story. > > > > > > > > We would like to implement some kind of a check which would detect > > > > this so that node would not start in the first place so no issues > > > > would be there at all and it would be up to operators to figure out > > > > first what to do with it. > > > > > > > > There are a couple of ideas we were exploring with various pros and > > > > cons and I would like to know what you think about them. > > > > > > > > 1) Register a shutdown hook on "drain". This is already there (1). > > > > "drain" method is doing quite a lot of stuff and this is called on > > > > shutdown so our idea is to write a timestamp to system.local into a > > > > new column like "lastly_drained" or something like that and it would > > > > be read on startup. > > > > > > > > The disadvantage of this approach, or all approaches via shutdown > > > > hooks, is that it will only react only on SIGTERM and SIGINT. If that > > > > node is killed via SIGKILL, JVM just stops and there is basically > > > > nothing we have any guarantee of that would leave some traces behind. > > > > > > > > If it is killed and that value is not overwritten, on the next > startup > > > > it might happen that it would be older than 10 days so it will > falsely > > > > evaluate it should not be started. > > > > > > > > 2) Doing this on startup, you would check how old all your sstables > > > > and commit logs are, if no file was modified less than 10 days ago > you > > > > would abort start, there is pretty big chance that your node did at > > > > least something in 10 days, there does not need to be anything added > > > > to system tables or similar and it would be just another > StartupCheck. > > > > > > > > The disadvantage of this is that some dev clusters, for example, may > > > > run more than 10 days and they are just sitting there doing > absolutely > > > > nothing at all, nobody interacts with them, nobody is repairing them, > > > > they are just sitting there.
Re: [DISCUSS] Releasable trunk and quality
I agree with David. CI has been pretty reliable besides the random jenkins going down or timeout. The same 3 or 4 tests were the only flaky ones in jenkins and Circle was very green. I bisected a couple failures to legit code errors, David is fixing some more, others have as well, etc It is good news imo as we're just getting to learn our CI post 4.0 is reliable and we need to start treating it as so and paying attention to it's reports. Not perfect but reliable enough it would have prevented those bugs getting merged. In fact we're having this conversation bc we noticed CI going from a steady 3-ish failures to many and it's getting fixed. So we're moving in the right direction imo. On 3/11/21 19:25, David Capwell wrote: >> Itâs hard to gate commit on a clean CI run when thereâs flaky tests > I agree, this is also why so much effort was done in 4.0 release to remove as > much as possible. Just over 1 month ago we were not really having a flaky > test issue (outside of the sporadic timeout issues; my circle ci runs were > green constantly), and now the âflaky testsâ I see are all actual bugs (been > root causing 2 out of the 3 I reported) and some (not all) of the flakyness > was triggered by recent changes in the past month. > > Right now people do not believe the failing test is caused by their patch and > attribute to flakiness, which then causes the builds to start being flaky, > which then leads to a different author coming to fix the issue; this behavior > is what I would love to see go away. If we find a flaky test, we should do > the following > > 1) has it already been reported and who is working to fix? Can we block this > patch on the test being fixed? Flaky tests due to timing issues normally are > resolved very quickly, real bugs take longer. > 2) if not reported, why? If you are the first to see this issue than good > chance the patch caused the issue so should root cause. If you are not the > first to see it, why did others not report it (we tend to be good about this, > even to the point Brandon has to mark the new tickets as dupsâŚ)? > > I have committed when there were flakiness, and I have caused flakiness; not > saying I am perfect or that I do the above, just saying that if we all moved > to the above model we could start relying on CI. The biggest impact to our > stability is people actually root causing flaky tests. > >> I think we're going to need a system that >> understands the difference between success, failure, and timeouts > > I am curious how this system can know that the timeout is not an actual > failure. There was a bug in 4.0 with time serialization in message, which > would cause the message to get dropped; this presented itself as a timeout if > I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams wrote: >> >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org >> wrote: >>> The largest number of test failures turn out (as pointed out by David) to >>> be due to how arcane it was to trigger the full test suite. Hopefully we >>> can get on top of that, but I think a significant remaining issue is a lack >>> of trust in the output of CI. Itâs hard to gate commit on a clean CI run >>> when thereâs flaky tests, and it doesnât take much to misattribute one >>> failing test to the existing flakiness (I tend to compare to a run of the >>> trunk baseline for comparison, but this is burdensome and still error >>> prone). The more flaky tests there are the more likely this is. >>> >>> This is in my opinion the real cost of flaky tests, and itâs probably worth >>> trying to crack down on them hard if we can. Itâs possible the Simulator >>> may help here, when I finally finish it up, as we can port flaky tests to >>> run with the Simulator and the failing seed can then be explored >>> deterministically (all being well). >> I totally agree that the lack of trust is a driving problem here, even >> in knowing which CI system to rely on. When Jenkins broke but Circle >> was fine, we all assumed it was a problem with Jenkins, right up until >> Circle also broke. >> >> In testing a distributed system like this I think we're always going >> to have failures, even on non-flaky tests, simply because the >> underlying infrastructure is variable with transient failures of its >> own (the network is reliable!) We can fix the flakies where the fault >> is in the code (and we've done this to many already) but to get more >> trustworthy output, I think we're going to need a system that >> understands the difference between success, failure, and timeouts, and >> in the latter case knows how to at least mark them differently. >> Simulator may help, as do the in-jvm dtests, but there is ultimately >> no way to cover everything without doing some things the hard, more >> realistic way where sometimes shit happens, marring the almost-perfect >> runs with noisy doubt, which then has to be
Re: The most reliable way to determine the last time node was up
What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. On 3/11/21 21:53, Stefan Miklosovic wrote: > Hi, > > We see a lot of cases out there when a node was down for longer than > the GC period and once that node is up there are a lot of zombie data > issues ... you know the story. > > We would like to implement some kind of a check which would detect > this so that node would not start in the first place so no issues > would be there at all and it would be up to operators to figure out > first what to do with it. > > There are a couple of ideas we were exploring with various pros and > cons and I would like to know what you think about them. > > 1) Register a shutdown hook on "drain". This is already there (1). > "drain" method is doing quite a lot of stuff and this is called on > shutdown so our idea is to write a timestamp to system.local into a > new column like "lastly_drained" or something like that and it would > be read on startup. > > The disadvantage of this approach, or all approaches via shutdown > hooks, is that it will only react only on SIGTERM and SIGINT. If that > node is killed via SIGKILL, JVM just stops and there is basically > nothing we have any guarantee of that would leave some traces behind. > > If it is killed and that value is not overwritten, on the next startup > it might happen that it would be older than 10 days so it will falsely > evaluate it should not be started. > > 2) Doing this on startup, you would check how old all your sstables > and commit logs are, if no file was modified less than 10 days ago you > would abort start, there is pretty big chance that your node did at > least something in 10 days, there does not need to be anything added > to system tables or similar and it would be just another StartupCheck. > > The disadvantage of this is that some dev clusters, for example, may > run more than 10 days and they are just sitting there doing absolutely > nothing at all, nobody interacts with them, nobody is repairing them, > they are just sitting there. So when nobody talks to these nodes, no > files are modified, right? > > It seems like there is not a silver bullet here, what is your opinion on this? > > Regards > > (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > > . - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: The most reliable way to determine the last time node was up
Apologies, I missed Paulo's reply on my email client threading funnies... On 4/11/21 7:50, Berenguer Blasi wrote: > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. > > On 3/11/21 21:53, Stefan Miklosovic wrote: >> Hi, >> >> We see a lot of cases out there when a node was down for longer than >> the GC period and once that node is up there are a lot of zombie data >> issues ... you know the story. >> >> We would like to implement some kind of a check which would detect >> this so that node would not start in the first place so no issues >> would be there at all and it would be up to operators to figure out >> first what to do with it. >> >> There are a couple of ideas we were exploring with various pros and >> cons and I would like to know what you think about them. >> >> 1) Register a shutdown hook on "drain". This is already there (1). >> "drain" method is doing quite a lot of stuff and this is called on >> shutdown so our idea is to write a timestamp to system.local into a >> new column like "lastly_drained" or something like that and it would >> be read on startup. >> >> The disadvantage of this approach, or all approaches via shutdown >> hooks, is that it will only react only on SIGTERM and SIGINT. If that >> node is killed via SIGKILL, JVM just stops and there is basically >> nothing we have any guarantee of that would leave some traces behind. >> >> If it is killed and that value is not overwritten, on the next startup >> it might happen that it would be older than 10 days so it will falsely >> evaluate it should not be started. >> >> 2) Doing this on startup, you would check how old all your sstables >> and commit logs are, if no file was modified less than 10 days ago you >> would abort start, there is pretty big chance that your node did at >> least something in 10 days, there does not need to be anything added >> to system tables or similar and it would be just another StartupCheck. >> >> The disadvantage of this is that some dev clusters, for example, may >> run more than 10 days and they are just sitting there doing absolutely >> nothing at all, nobody interacts with them, nobody is repairing them, >> they are just sitting there. So when nobody talks to these nodes, no >> files are modified, right? >> >> It seems like there is not a silver bullet here, what is your opinion on >> this? >> >> Regards >> >> (1) >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> . - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org