Re: [DISCUSS] Releasable trunk and quality

2021-11-04 Thread Joshua McKenzie
>
> we noticed CI going from a
> steady 3-ish failures to many and it's getting fixed. So we're moving in
> the right direction imo.
>
An observation about this: there's tooling and technology widely in use to
help prevent ever getting into this state (to Benedict's point: blocking
merge on CI failure, or nightly tests and reverting regression commits,
etc). I think there's significant time and energy savings for us in using
automation to be proactive about the quality of our test boards rather than
reactive.

I 100% agree that it's heartening to see that the quality of the codebase
is improving as is the discipline / attentiveness of our collective
culture. That said, I believe we still have a pretty fragile system when it
comes to test failure accumulation.

On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi 
wrote:

> I agree with David. CI has been pretty reliable besides the random
> jenkins going down or timeout. The same 3 or 4 tests were the only flaky
> ones in jenkins and Circle was very green. I bisected a couple failures
> to legit code errors, David is fixing some more, others have as well, etc
>
> It is good news imo as we're just getting to learn our CI post 4.0 is
> reliable and we need to start treating it as so and paying attention to
> it's reports. Not perfect but reliable enough it would have prevented
> those bugs getting merged.
>
> In fact we're having this conversation bc we noticed CI going from a
> steady 3-ish failures to many and it's getting fixed. So we're moving in
> the right direction imo.
>
> On 3/11/21 19:25, David Capwell wrote:
> >> It’s hard to gate commit on a clean CI run when there’s flaky tests
> > I agree, this is also why so much effort was done in 4.0 release to
> remove as much as possible.  Just over 1 month ago we were not really
> having a flaky test issue (outside of the sporadic timeout issues; my
> circle ci runs were green constantly), and now the “flaky tests” I see are
> all actual bugs (been root causing 2 out of the 3 I reported) and some (not
> all) of the flakyness was triggered by recent changes in the past month.
> >
> > Right now people do not believe the failing test is caused by their
> patch and attribute to flakiness, which then causes the builds to start
> being flaky, which then leads to a different author coming to fix the
> issue; this behavior is what I would love to see go away.  If we find a
> flaky test, we should do the following
> >
> > 1) has it already been reported and who is working to fix?  Can we block
> this patch on the test being fixed?  Flaky tests due to timing issues
> normally are resolved very quickly, real bugs take longer.
> > 2) if not reported, why?  If you are the first to see this issue than
> good chance the patch caused the issue so should root cause.  If you are
> not the first to see it, why did others not report it (we tend to be good
> about this, even to the point Brandon has to mark the new tickets as dups…)?
> >
> > I have committed when there were flakiness, and I have caused flakiness;
> not saying I am perfect or that I do the above, just saying that if we all
> moved to the above model we could start relying on CI.  The biggest impact
> to our stability is people actually root causing flaky tests.
> >
> >>  I think we're going to need a system that
> >> understands the difference between success, failure, and timeouts
> >
> > I am curious how this system can know that the timeout is not an actual
> failure.  There was a bug in 4.0 with time serialization in message, which
> would cause the message to get dropped; this presented itself as a timeout
> if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe).
> >
> >> On Nov 3, 2021, at 10:56 AM, Brandon Williams  wrote:
> >>
> >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org <
> bened...@apache.org> wrote:
> >>> The largest number of test failures turn out (as pointed out by David)
> to be due to how arcane it was to trigger the full test suite. Hopefully we
> can get on top of that, but I think a significant remaining issue is a lack
> of trust in the output of CI. It’s hard to gate commit on a clean CI run
> when there’s flaky tests, and it doesn’t take much to misattribute one
> failing test to the existing flakiness (I tend to compare to a run of the
> trunk baseline for comparison, but this is burdensome and still error
> prone). The more flaky tests there are the more likely this is.
> >>>
> >>> This is in my opinion the real cost of flaky tests, and it’s probably
> worth trying to crack down on them hard if we can. It’s possible the
> Simulator may help here, when I finally finish it up, as we can port flaky
> tests to run with the Simulator and the failing seed can then be explored
> deterministically (all being well).
> >> I totally agree that the lack of trust is a driving problem here, even
> >> in knowing which CI system to rely on. When Jenkins broke but Circle
> >> was fine, we all assumed it was a problem with Jenk

Re: [DISCUSS] Releasable trunk and quality

2021-11-04 Thread Andrés de la Peña
Hi all,

we already have a way to confirm flakiness on circle by running the test
> repeatedly N times. Like 100 or 500. That has proven to work very well
> so far, at least for me. #collaborating #justfyi


I think it would be helpful if we always ran the repeated test jobs at
CircleCI when we add a new test or modify an existing one. Running those
jobs, when applicable, could be a requirement before committing. This
wouldn't help us when the changes affect many different tests or we are not
able to identify the tests affected by our changes, but I think it could
have prevented many of the recently fixed flakies.


On Thu, 4 Nov 2021 at 12:24, Joshua McKenzie  wrote:

> >
> > we noticed CI going from a
> > steady 3-ish failures to many and it's getting fixed. So we're moving in
> > the right direction imo.
> >
> An observation about this: there's tooling and technology widely in use to
> help prevent ever getting into this state (to Benedict's point: blocking
> merge on CI failure, or nightly tests and reverting regression commits,
> etc). I think there's significant time and energy savings for us in using
> automation to be proactive about the quality of our test boards rather than
> reactive.
>
> I 100% agree that it's heartening to see that the quality of the codebase
> is improving as is the discipline / attentiveness of our collective
> culture. That said, I believe we still have a pretty fragile system when it
> comes to test failure accumulation.
>
> On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi 
> wrote:
>
> > I agree with David. CI has been pretty reliable besides the random
> > jenkins going down or timeout. The same 3 or 4 tests were the only flaky
> > ones in jenkins and Circle was very green. I bisected a couple failures
> > to legit code errors, David is fixing some more, others have as well, etc
> >
> > It is good news imo as we're just getting to learn our CI post 4.0 is
> > reliable and we need to start treating it as so and paying attention to
> > it's reports. Not perfect but reliable enough it would have prevented
> > those bugs getting merged.
> >
> > In fact we're having this conversation bc we noticed CI going from a
> > steady 3-ish failures to many and it's getting fixed. So we're moving in
> > the right direction imo.
> >
> > On 3/11/21 19:25, David Capwell wrote:
> > >> It’s hard to gate commit on a clean CI run when there’s flaky tests
> > > I agree, this is also why so much effort was done in 4.0 release to
> > remove as much as possible.  Just over 1 month ago we were not really
> > having a flaky test issue (outside of the sporadic timeout issues; my
> > circle ci runs were green constantly), and now the “flaky tests” I see
> are
> > all actual bugs (been root causing 2 out of the 3 I reported) and some
> (not
> > all) of the flakyness was triggered by recent changes in the past month.
> > >
> > > Right now people do not believe the failing test is caused by their
> > patch and attribute to flakiness, which then causes the builds to start
> > being flaky, which then leads to a different author coming to fix the
> > issue; this behavior is what I would love to see go away.  If we find a
> > flaky test, we should do the following
> > >
> > > 1) has it already been reported and who is working to fix?  Can we
> block
> > this patch on the test being fixed?  Flaky tests due to timing issues
> > normally are resolved very quickly, real bugs take longer.
> > > 2) if not reported, why?  If you are the first to see this issue than
> > good chance the patch caused the issue so should root cause.  If you are
> > not the first to see it, why did others not report it (we tend to be good
> > about this, even to the point Brandon has to mark the new tickets as
> dups…)?
> > >
> > > I have committed when there were flakiness, and I have caused
> flakiness;
> > not saying I am perfect or that I do the above, just saying that if we
> all
> > moved to the above model we could start relying on CI.  The biggest
> impact
> > to our stability is people actually root causing flaky tests.
> > >
> > >>  I think we're going to need a system that
> > >> understands the difference between success, failure, and timeouts
> > >
> > > I am curious how this system can know that the timeout is not an actual
> > failure.  There was a bug in 4.0 with time serialization in message,
> which
> > would cause the message to get dropped; this presented itself as a
> timeout
> > if I remember properly (Jon Meredith or Yifan Cai fixed this bug I
> believe).
> > >
> > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams 
> wrote:
> > >>
> > >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org <
> > bened...@apache.org> wrote:
> > >>> The largest number of test failures turn out (as pointed out by
> David)
> > to be due to how arcane it was to trigger the full test suite. Hopefully
> we
> > can get on top of that, but I think a significant remaining issue is a
> lack
> > of trust in the output of CI. It’s hard to gate com

Re: The most reliable way to determine the last time node was up

2021-11-04 Thread Elliott Sims
To deal with this, I've just made a very small Bash script that looks at
commitlog age, then set the script as an "ExecStartPre=" in systemd:

if [[ -d '/opt/cassandra/data/data' && $(/usr/bin/find
/opt/cassandra/data/commitlog/ -name 'CommitLog*.log' -mtime -8 | wc -l)
-eq 0 ]]; then
  >&2  echo "ERROR:  precheck filed, Cassandra data too old"
  exit 10
fi

First conditional is to reduce false-positives on brand new machines with
no data.
I suspect it'll false-positive if your writes are extremely rare (that is,
basically read-only), but at that point you may not need it at all.
(adjust as needed for your grace period and paths)

On Thu, Nov 4, 2021 at 12:54 AM Berenguer Blasi 
wrote:

> Apologies, I missed Paulo's reply on my email client threading funnies...
>
> On 4/11/21 7:50, Berenguer Blasi wrote:
> > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
> >
> > On 3/11/21 21:53, Stefan Miklosovic wrote:
> >> Hi,
> >>
> >> We see a lot of cases out there when a node was down for longer than
> >> the GC period and once that node is up there are a lot of zombie data
> >> issues ... you know the story.
> >>
> >> We would like to implement some kind of a check which would detect
> >> this so that node would not start in the first place so no issues
> >> would be there at all and it would be up to operators to figure out
> >> first what to do with it.
> >>
> >> There are a couple of ideas we were exploring with various pros and
> >> cons and I would like to know what you think about them.
> >>
> >> 1) Register a shutdown hook on "drain". This is already there (1).
> >> "drain" method is doing quite a lot of stuff and this is called on
> >> shutdown so our idea is to write a timestamp to system.local into a
> >> new column like "lastly_drained" or something like that and it would
> >> be read on startup.
> >>
> >> The disadvantage of this approach, or all approaches via shutdown
> >> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> >> node is killed via SIGKILL, JVM just stops and there is basically
> >> nothing we have any guarantee of that would leave some traces behind.
> >>
> >> If it is killed and that value is not overwritten, on the next startup
> >> it might happen that it would be older than 10 days so it will falsely
> >> evaluate it should not be started.
> >>
> >> 2) Doing this on startup, you would check how old all your sstables
> >> and commit logs are, if no file was modified less than 10 days ago you
> >> would abort start, there is pretty big chance that your node did at
> >> least something in 10 days, there does not need to be anything added
> >> to system tables or similar and it would be just another StartupCheck.
> >>
> >> The disadvantage of this is that some dev clusters, for example, may
> >> run more than 10 days and they are just sitting there doing absolutely
> >> nothing at all, nobody interacts with them, nobody is repairing them,
> >> they are just sitting there. So when nobody talks to these nodes, no
> >> files are modified, right?
> >>
> >> It seems like there is not a silver bullet here, what is your opinion
> on this?
> >>
> >> Regards
> >>
> >> (1)
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >> .
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: The most reliable way to determine the last time node was up

2021-11-04 Thread Brandon Williams
If you always drain you won't have any commit logs.

On Thu, Nov 4, 2021 at 2:57 PM Elliott Sims  wrote:
>
> To deal with this, I've just made a very small Bash script that looks at
> commitlog age, then set the script as an "ExecStartPre=" in systemd:
>
> if [[ -d '/opt/cassandra/data/data' && $(/usr/bin/find
> /opt/cassandra/data/commitlog/ -name 'CommitLog*.log' -mtime -8 | wc -l)
> -eq 0 ]]; then
>   >&2  echo "ERROR:  precheck filed, Cassandra data too old"
>   exit 10
> fi
>
> First conditional is to reduce false-positives on brand new machines with
> no data.
> I suspect it'll false-positive if your writes are extremely rare (that is,
> basically read-only), but at that point you may not need it at all.
> (adjust as needed for your grace period and paths)
>
> On Thu, Nov 4, 2021 at 12:54 AM Berenguer Blasi 
> wrote:
>
> > Apologies, I missed Paulo's reply on my email client threading funnies...
> >
> > On 4/11/21 7:50, Berenguer Blasi wrote:
> > > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
> > >
> > > On 3/11/21 21:53, Stefan Miklosovic wrote:
> > >> Hi,
> > >>
> > >> We see a lot of cases out there when a node was down for longer than
> > >> the GC period and once that node is up there are a lot of zombie data
> > >> issues ... you know the story.
> > >>
> > >> We would like to implement some kind of a check which would detect
> > >> this so that node would not start in the first place so no issues
> > >> would be there at all and it would be up to operators to figure out
> > >> first what to do with it.
> > >>
> > >> There are a couple of ideas we were exploring with various pros and
> > >> cons and I would like to know what you think about them.
> > >>
> > >> 1) Register a shutdown hook on "drain". This is already there (1).
> > >> "drain" method is doing quite a lot of stuff and this is called on
> > >> shutdown so our idea is to write a timestamp to system.local into a
> > >> new column like "lastly_drained" or something like that and it would
> > >> be read on startup.
> > >>
> > >> The disadvantage of this approach, or all approaches via shutdown
> > >> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> > >> node is killed via SIGKILL, JVM just stops and there is basically
> > >> nothing we have any guarantee of that would leave some traces behind.
> > >>
> > >> If it is killed and that value is not overwritten, on the next startup
> > >> it might happen that it would be older than 10 days so it will falsely
> > >> evaluate it should not be started.
> > >>
> > >> 2) Doing this on startup, you would check how old all your sstables
> > >> and commit logs are, if no file was modified less than 10 days ago you
> > >> would abort start, there is pretty big chance that your node did at
> > >> least something in 10 days, there does not need to be anything added
> > >> to system tables or similar and it would be just another StartupCheck.
> > >>
> > >> The disadvantage of this is that some dev clusters, for example, may
> > >> run more than 10 days and they are just sitting there doing absolutely
> > >> nothing at all, nobody interacts with them, nobody is repairing them,
> > >> they are just sitting there. So when nobody talks to these nodes, no
> > >> files are modified, right?
> > >>
> > >> It seems like there is not a silver bullet here, what is your opinion
> > on this?
> > >>
> > >> Regards
> > >>
> > >> (1)
> > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>
> > >> .
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org