Re: [DISCUSS] Releasable trunk and quality
> > we noticed CI going from a > steady 3-ish failures to many and it's getting fixed. So we're moving in > the right direction imo. > An observation about this: there's tooling and technology widely in use to help prevent ever getting into this state (to Benedict's point: blocking merge on CI failure, or nightly tests and reverting regression commits, etc). I think there's significant time and energy savings for us in using automation to be proactive about the quality of our test boards rather than reactive. I 100% agree that it's heartening to see that the quality of the codebase is improving as is the discipline / attentiveness of our collective culture. That said, I believe we still have a pretty fragile system when it comes to test failure accumulation. On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi wrote: > I agree with David. CI has been pretty reliable besides the random > jenkins going down or timeout. The same 3 or 4 tests were the only flaky > ones in jenkins and Circle was very green. I bisected a couple failures > to legit code errors, David is fixing some more, others have as well, etc > > It is good news imo as we're just getting to learn our CI post 4.0 is > reliable and we need to start treating it as so and paying attention to > it's reports. Not perfect but reliable enough it would have prevented > those bugs getting merged. > > In fact we're having this conversation bc we noticed CI going from a > steady 3-ish failures to many and it's getting fixed. So we're moving in > the right direction imo. > > On 3/11/21 19:25, David Capwell wrote: > >> It’s hard to gate commit on a clean CI run when there’s flaky tests > > I agree, this is also why so much effort was done in 4.0 release to > remove as much as possible. Just over 1 month ago we were not really > having a flaky test issue (outside of the sporadic timeout issues; my > circle ci runs were green constantly), and now the “flaky tests” I see are > all actual bugs (been root causing 2 out of the 3 I reported) and some (not > all) of the flakyness was triggered by recent changes in the past month. > > > > Right now people do not believe the failing test is caused by their > patch and attribute to flakiness, which then causes the builds to start > being flaky, which then leads to a different author coming to fix the > issue; this behavior is what I would love to see go away. If we find a > flaky test, we should do the following > > > > 1) has it already been reported and who is working to fix? Can we block > this patch on the test being fixed? Flaky tests due to timing issues > normally are resolved very quickly, real bugs take longer. > > 2) if not reported, why? If you are the first to see this issue than > good chance the patch caused the issue so should root cause. If you are > not the first to see it, why did others not report it (we tend to be good > about this, even to the point Brandon has to mark the new tickets as dups…)? > > > > I have committed when there were flakiness, and I have caused flakiness; > not saying I am perfect or that I do the above, just saying that if we all > moved to the above model we could start relying on CI. The biggest impact > to our stability is people actually root causing flaky tests. > > > >> I think we're going to need a system that > >> understands the difference between success, failure, and timeouts > > > > I am curious how this system can know that the timeout is not an actual > failure. There was a bug in 4.0 with time serialization in message, which > would cause the message to get dropped; this presented itself as a timeout > if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). > > > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams wrote: > >> > >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org < > bened...@apache.org> wrote: > >>> The largest number of test failures turn out (as pointed out by David) > to be due to how arcane it was to trigger the full test suite. Hopefully we > can get on top of that, but I think a significant remaining issue is a lack > of trust in the output of CI. It’s hard to gate commit on a clean CI run > when there’s flaky tests, and it doesn’t take much to misattribute one > failing test to the existing flakiness (I tend to compare to a run of the > trunk baseline for comparison, but this is burdensome and still error > prone). The more flaky tests there are the more likely this is. > >>> > >>> This is in my opinion the real cost of flaky tests, and it’s probably > worth trying to crack down on them hard if we can. It’s possible the > Simulator may help here, when I finally finish it up, as we can port flaky > tests to run with the Simulator and the failing seed can then be explored > deterministically (all being well). > >> I totally agree that the lack of trust is a driving problem here, even > >> in knowing which CI system to rely on. When Jenkins broke but Circle > >> was fine, we all assumed it was a problem with Jenk
Re: [DISCUSS] Releasable trunk and quality
Hi all, we already have a way to confirm flakiness on circle by running the test > repeatedly N times. Like 100 or 500. That has proven to work very well > so far, at least for me. #collaborating #justfyi I think it would be helpful if we always ran the repeated test jobs at CircleCI when we add a new test or modify an existing one. Running those jobs, when applicable, could be a requirement before committing. This wouldn't help us when the changes affect many different tests or we are not able to identify the tests affected by our changes, but I think it could have prevented many of the recently fixed flakies. On Thu, 4 Nov 2021 at 12:24, Joshua McKenzie wrote: > > > > we noticed CI going from a > > steady 3-ish failures to many and it's getting fixed. So we're moving in > > the right direction imo. > > > An observation about this: there's tooling and technology widely in use to > help prevent ever getting into this state (to Benedict's point: blocking > merge on CI failure, or nightly tests and reverting regression commits, > etc). I think there's significant time and energy savings for us in using > automation to be proactive about the quality of our test boards rather than > reactive. > > I 100% agree that it's heartening to see that the quality of the codebase > is improving as is the discipline / attentiveness of our collective > culture. That said, I believe we still have a pretty fragile system when it > comes to test failure accumulation. > > On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi > wrote: > > > I agree with David. CI has been pretty reliable besides the random > > jenkins going down or timeout. The same 3 or 4 tests were the only flaky > > ones in jenkins and Circle was very green. I bisected a couple failures > > to legit code errors, David is fixing some more, others have as well, etc > > > > It is good news imo as we're just getting to learn our CI post 4.0 is > > reliable and we need to start treating it as so and paying attention to > > it's reports. Not perfect but reliable enough it would have prevented > > those bugs getting merged. > > > > In fact we're having this conversation bc we noticed CI going from a > > steady 3-ish failures to many and it's getting fixed. So we're moving in > > the right direction imo. > > > > On 3/11/21 19:25, David Capwell wrote: > > >> It’s hard to gate commit on a clean CI run when there’s flaky tests > > > I agree, this is also why so much effort was done in 4.0 release to > > remove as much as possible. Just over 1 month ago we were not really > > having a flaky test issue (outside of the sporadic timeout issues; my > > circle ci runs were green constantly), and now the “flaky tests” I see > are > > all actual bugs (been root causing 2 out of the 3 I reported) and some > (not > > all) of the flakyness was triggered by recent changes in the past month. > > > > > > Right now people do not believe the failing test is caused by their > > patch and attribute to flakiness, which then causes the builds to start > > being flaky, which then leads to a different author coming to fix the > > issue; this behavior is what I would love to see go away. If we find a > > flaky test, we should do the following > > > > > > 1) has it already been reported and who is working to fix? Can we > block > > this patch on the test being fixed? Flaky tests due to timing issues > > normally are resolved very quickly, real bugs take longer. > > > 2) if not reported, why? If you are the first to see this issue than > > good chance the patch caused the issue so should root cause. If you are > > not the first to see it, why did others not report it (we tend to be good > > about this, even to the point Brandon has to mark the new tickets as > dups…)? > > > > > > I have committed when there were flakiness, and I have caused > flakiness; > > not saying I am perfect or that I do the above, just saying that if we > all > > moved to the above model we could start relying on CI. The biggest > impact > > to our stability is people actually root causing flaky tests. > > > > > >> I think we're going to need a system that > > >> understands the difference between success, failure, and timeouts > > > > > > I am curious how this system can know that the timeout is not an actual > > failure. There was a bug in 4.0 with time serialization in message, > which > > would cause the message to get dropped; this presented itself as a > timeout > > if I remember properly (Jon Meredith or Yifan Cai fixed this bug I > believe). > > > > > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams > wrote: > > >> > > >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org < > > bened...@apache.org> wrote: > > >>> The largest number of test failures turn out (as pointed out by > David) > > to be due to how arcane it was to trigger the full test suite. Hopefully > we > > can get on top of that, but I think a significant remaining issue is a > lack > > of trust in the output of CI. It’s hard to gate com
Re: The most reliable way to determine the last time node was up
To deal with this, I've just made a very small Bash script that looks at commitlog age, then set the script as an "ExecStartPre=" in systemd: if [[ -d '/opt/cassandra/data/data' && $(/usr/bin/find /opt/cassandra/data/commitlog/ -name 'CommitLog*.log' -mtime -8 | wc -l) -eq 0 ]]; then >&2 echo "ERROR: precheck filed, Cassandra data too old" exit 10 fi First conditional is to reduce false-positives on brand new machines with no data. I suspect it'll false-positive if your writes are extremely rare (that is, basically read-only), but at that point you may not need it at all. (adjust as needed for your grace period and paths) On Thu, Nov 4, 2021 at 12:54 AM Berenguer Blasi wrote: > Apologies, I missed Paulo's reply on my email client threading funnies... > > On 4/11/21 7:50, Berenguer Blasi wrote: > > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. > > > > On 3/11/21 21:53, Stefan Miklosovic wrote: > >> Hi, > >> > >> We see a lot of cases out there when a node was down for longer than > >> the GC period and once that node is up there are a lot of zombie data > >> issues ... you know the story. > >> > >> We would like to implement some kind of a check which would detect > >> this so that node would not start in the first place so no issues > >> would be there at all and it would be up to operators to figure out > >> first what to do with it. > >> > >> There are a couple of ideas we were exploring with various pros and > >> cons and I would like to know what you think about them. > >> > >> 1) Register a shutdown hook on "drain". This is already there (1). > >> "drain" method is doing quite a lot of stuff and this is called on > >> shutdown so our idea is to write a timestamp to system.local into a > >> new column like "lastly_drained" or something like that and it would > >> be read on startup. > >> > >> The disadvantage of this approach, or all approaches via shutdown > >> hooks, is that it will only react only on SIGTERM and SIGINT. If that > >> node is killed via SIGKILL, JVM just stops and there is basically > >> nothing we have any guarantee of that would leave some traces behind. > >> > >> If it is killed and that value is not overwritten, on the next startup > >> it might happen that it would be older than 10 days so it will falsely > >> evaluate it should not be started. > >> > >> 2) Doing this on startup, you would check how old all your sstables > >> and commit logs are, if no file was modified less than 10 days ago you > >> would abort start, there is pretty big chance that your node did at > >> least something in 10 days, there does not need to be anything added > >> to system tables or similar and it would be just another StartupCheck. > >> > >> The disadvantage of this is that some dev clusters, for example, may > >> run more than 10 days and they are just sitting there doing absolutely > >> nothing at all, nobody interacts with them, nobody is repairing them, > >> they are just sitting there. So when nobody talks to these nodes, no > >> files are modified, right? > >> > >> It seems like there is not a silver bullet here, what is your opinion > on this? > >> > >> Regards > >> > >> (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >> For additional commands, e-mail: dev-h...@cassandra.apache.org > >> > >> . > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >
Re: The most reliable way to determine the last time node was up
If you always drain you won't have any commit logs. On Thu, Nov 4, 2021 at 2:57 PM Elliott Sims wrote: > > To deal with this, I've just made a very small Bash script that looks at > commitlog age, then set the script as an "ExecStartPre=" in systemd: > > if [[ -d '/opt/cassandra/data/data' && $(/usr/bin/find > /opt/cassandra/data/commitlog/ -name 'CommitLog*.log' -mtime -8 | wc -l) > -eq 0 ]]; then > >&2 echo "ERROR: precheck filed, Cassandra data too old" > exit 10 > fi > > First conditional is to reduce false-positives on brand new machines with > no data. > I suspect it'll false-positive if your writes are extremely rare (that is, > basically read-only), but at that point you may not need it at all. > (adjust as needed for your grace period and paths) > > On Thu, Nov 4, 2021 at 12:54 AM Berenguer Blasi > wrote: > > > Apologies, I missed Paulo's reply on my email client threading funnies... > > > > On 4/11/21 7:50, Berenguer Blasi wrote: > > > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. > > > > > > On 3/11/21 21:53, Stefan Miklosovic wrote: > > >> Hi, > > >> > > >> We see a lot of cases out there when a node was down for longer than > > >> the GC period and once that node is up there are a lot of zombie data > > >> issues ... you know the story. > > >> > > >> We would like to implement some kind of a check which would detect > > >> this so that node would not start in the first place so no issues > > >> would be there at all and it would be up to operators to figure out > > >> first what to do with it. > > >> > > >> There are a couple of ideas we were exploring with various pros and > > >> cons and I would like to know what you think about them. > > >> > > >> 1) Register a shutdown hook on "drain". This is already there (1). > > >> "drain" method is doing quite a lot of stuff and this is called on > > >> shutdown so our idea is to write a timestamp to system.local into a > > >> new column like "lastly_drained" or something like that and it would > > >> be read on startup. > > >> > > >> The disadvantage of this approach, or all approaches via shutdown > > >> hooks, is that it will only react only on SIGTERM and SIGINT. If that > > >> node is killed via SIGKILL, JVM just stops and there is basically > > >> nothing we have any guarantee of that would leave some traces behind. > > >> > > >> If it is killed and that value is not overwritten, on the next startup > > >> it might happen that it would be older than 10 days so it will falsely > > >> evaluate it should not be started. > > >> > > >> 2) Doing this on startup, you would check how old all your sstables > > >> and commit logs are, if no file was modified less than 10 days ago you > > >> would abort start, there is pretty big chance that your node did at > > >> least something in 10 days, there does not need to be anything added > > >> to system tables or similar and it would be just another StartupCheck. > > >> > > >> The disadvantage of this is that some dev clusters, for example, may > > >> run more than 10 days and they are just sitting there doing absolutely > > >> nothing at all, nobody interacts with them, nobody is repairing them, > > >> they are just sitting there. So when nobody talks to these nodes, no > > >> files are modified, right? > > >> > > >> It seems like there is not a silver bullet here, what is your opinion > > on this? > > >> > > >> Regards > > >> > > >> (1) > > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 > > >> > > >> - > > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > >> For additional commands, e-mail: dev-h...@cassandra.apache.org > > >> > > >> . > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org