Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Berenguer Blasi
Apologies, I missed Paulo's reply on my email client threading funnies... On 4/11/21 7:50, Berenguer Blasi wrote: > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. > > On 3/11/21 21:53, Stefan Miklosovic wrote: >> Hi, >> >> We see a lot of cases out there when a node was down fo

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Berenguer Blasi
What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. On 3/11/21 21:53, Stefan Miklosovic wrote: > Hi, > > We see a lot of cases out there when a node was down for longer than > the GC period and once that node is up there are a lot of zombie data > issues ... you know the story. > >

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Berenguer Blasi
I agree with David. CI has been pretty reliable besides the random jenkins going down or timeout. The same 3 or 4 tests were the only flaky ones in jenkins and Circle was very green. I bisected a couple failures to legit code errors, David is fixing some more, others have as well, etc It is good n

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Paulo Motta
> I would expect that if nobody talks to a node and no operation is running, it does not produce any "side effects". In order to track the last checkpoint timestamp you need to persist it periodically to prevent against losing state during an ungraceful shutdown (ie. kill -9). However you're righ

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
Yes this is the combination of system.local and "marker file" approach, basically updating that field periodically. However, when there is a mutation done against the system table (in this example), it goes to a commit log and then it will be propagated to sstable on disk, no? So in our hypothetic

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Paulo Motta
How about a last_checkpoint (or better name) system.local column that is updated periodically (ie. every minute) + on drain? This would give a lower time bound on when the node was last live without requiring an external marker file. On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic < stefan.mikloso..

Re: The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
The third option would be to have some thread running in the background "touching" some (empty) marker file, it is the most simple solution but I do not like the idea of this marker file, it feels dirty, but hey, while it would be opt-in feature for people knowing what they want, why not right ...

The most reliable way to determine the last time node was up

2021-11-03 Thread Stefan Miklosovic
Hi, We see a lot of cases out there when a node was down for longer than the GC period and once that node is up there are a lot of zombie data issues ... you know the story. We would like to implement some kind of a check which would detect this so that node would not start in the first place so

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Wed, Nov 3, 2021 at 1:26 PM David Capwell wrote: > > I think we're going to need a system that > > understands the difference between success, failure, and timeouts > > > I am curious how this system can know that the timeout is not an actual > failure. There was a bug in 4.0 with time seri

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread David Capwell
> It’s hard to gate commit on a clean CI run when there’s flaky tests I agree, this is also why so much effort was done in 4.0 release to remove as much as possible. Just over 1 month ago we were not really having a flaky test issue (outside of the sporadic timeout issues; my circle ci runs wer

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org wrote: > > The largest number of test failures turn out (as pointed out by David) to be > due to how arcane it was to trigger the full test suite. Hopefully we can get > on top of that, but I think a significant remaining issue is a lack of tru

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread bened...@apache.org
The largest number of test failures turn out (as pointed out by David) to be due to how arcane it was to trigger the full test suite. Hopefully we can get on top of that, but I think a significant remaining issue is a lack of trust in the output of CI. It’s hard to gate commit on a clean CI run

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Brandon Williams
On Mon, Nov 1, 2021 at 5:03 PM David Capwell wrote: > > > How do we define what "releasable trunk" means? > > One thing I would love is for us to adopt a “run all tests needed to release > before commit” mentality, and to link a successful run in JIRA when closing > (we talked about this once in

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Joshua McKenzie
> > It'd be great to > expand this, but it's been somewhat difficult to do, since last time a > bootstrap test was attempted, it has immediately uncovered enough issues to > keep us busy fixing them for quite some time. Maybe it's about time to try > that again. I'm going to go with a "yes please"

Re: [DISCUSS] CEP-18: Improving Modularity

2021-11-03 Thread Jeremiah D Jordan
It seems like there are many people in this thread that would rather we not make a “grouping” CEP for these ideas, but rather consider each one individually. I will close out this CEP thread then, and discussions can continue on individual tickets. I think we got some nice discussion on this t

Re: [DISCUSS] Releasable trunk and quality

2021-11-03 Thread Oleksandr Petrov
I'll merge 16262 and the Harry blog-post that accompanies it shortly. Having 16262 merged will significantly reduce the amount of resistance one has to overcome in order to write a fuzz test. But this, of course, only covers short/small/unit-test-like tests. For longer running tests, I guess for n