Apologies, I missed Paulo's reply on my email client threading funnies...
On 4/11/21 7:50, Berenguer Blasi wrote:
> What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
>
> On 3/11/21 21:53, Stefan Miklosovic wrote:
>> Hi,
>>
>> We see a lot of cases out there when a node was down fo
What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
On 3/11/21 21:53, Stefan Miklosovic wrote:
> Hi,
>
> We see a lot of cases out there when a node was down for longer than
> the GC period and once that node is up there are a lot of zombie data
> issues ... you know the story.
>
>
I agree with David. CI has been pretty reliable besides the random
jenkins going down or timeout. The same 3 or 4 tests were the only flaky
ones in jenkins and Circle was very green. I bisected a couple failures
to legit code errors, David is fixing some more, others have as well, etc
It is good n
> I would expect that if nobody talks to a node and no operation is
running, it does not produce any "side effects".
In order to track the last checkpoint timestamp you need to persist it
periodically to prevent against losing state during an ungraceful shutdown
(ie. kill -9).
However you're righ
Yes this is the combination of system.local and "marker file"
approach, basically updating that field periodically.
However, when there is a mutation done against the system table (in
this example), it goes to a commit log and then it will be propagated
to sstable on disk, no? So in our hypothetic
How about a last_checkpoint (or better name) system.local column that is
updated periodically (ie. every minute) + on drain? This would give a lower
time bound on when the node was last live without requiring an external
marker file.
On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic <
stefan.mikloso..
The third option would be to have some thread running in the
background "touching" some (empty) marker file, it is the most simple
solution but I do not like the idea of this marker file, it feels
dirty, but hey, while it would be opt-in feature for people knowing
what they want, why not right ...
Hi,
We see a lot of cases out there when a node was down for longer than
the GC period and once that node is up there are a lot of zombie data
issues ... you know the story.
We would like to implement some kind of a check which would detect
this so that node would not start in the first place so
On Wed, Nov 3, 2021 at 1:26 PM David Capwell wrote:
> > I think we're going to need a system that
> > understands the difference between success, failure, and timeouts
>
>
> I am curious how this system can know that the timeout is not an actual
> failure. There was a bug in 4.0 with time seri
> It’s hard to gate commit on a clean CI run when there’s flaky tests
I agree, this is also why so much effort was done in 4.0 release to remove as
much as possible. Just over 1 month ago we were not really having a flaky test
issue (outside of the sporadic timeout issues; my circle ci runs wer
On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org wrote:
>
> The largest number of test failures turn out (as pointed out by David) to be
> due to how arcane it was to trigger the full test suite. Hopefully we can get
> on top of that, but I think a significant remaining issue is a lack of tru
The largest number of test failures turn out (as pointed out by David) to be
due to how arcane it was to trigger the full test suite. Hopefully we can get
on top of that, but I think a significant remaining issue is a lack of trust in
the output of CI. It’s hard to gate commit on a clean CI run
On Mon, Nov 1, 2021 at 5:03 PM David Capwell wrote:
>
> > How do we define what "releasable trunk" means?
>
> One thing I would love is for us to adopt a “run all tests needed to release
> before commit” mentality, and to link a successful run in JIRA when closing
> (we talked about this once in
>
> It'd be great to
> expand this, but it's been somewhat difficult to do, since last time a
> bootstrap test was attempted, it has immediately uncovered enough issues to
> keep us busy fixing them for quite some time. Maybe it's about time to try
> that again.
I'm going to go with a "yes please"
It seems like there are many people in this thread that would rather we not
make a “grouping” CEP for these ideas, but rather consider each one
individually. I will close out this CEP thread then, and discussions can
continue on individual tickets.
I think we got some nice discussion on this t
I'll merge 16262 and the Harry blog-post that accompanies it shortly.
Having 16262 merged will significantly reduce the amount of resistance one
has to overcome in order to write a fuzz test. But this, of course, only
covers short/small/unit-test-like tests.
For longer running tests, I guess for n
16 matches
Mail list logo