This one has been failing for a month: https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=solr-root&search.timeZoneId=America%2FNew_York&tests.container=org.apache.solr.handler.admin.api.ClusterPropsAPITest&tests.test=testClusterPropertyOpsAllGood (and Pierre fixed yesterday)
We have a new mechanism for some months now -- flaky test detection, thanks to a Gradle plugin and Sanjay for enabling it. Gradle Enterprise makes the distinction -- see this report for branch_9x: https://ge.apache.org/scans/tests?search.relativeStartTime=P90D&search.rootProjectNames=solr-root&search.tags=branch_9x&search.timeZoneId=America%2FNew_York and look for the red failure bars, which indicates a repeatable test failure according to the randomization seed. Every time we get a repeatable failure, it's fairly actionable. I'd love to get an email to our project for every such failure somehow (assuming a build fails for no more than a handful of such an issue). Then there would be a potential email thread exactly for this problem, versus yet another failing build email testing many things. Of course we'll need to tend to flaky tests too but at least the repeatable (non-flaky) is lower hanging fruit. On Mon, Oct 21, 2024 at 1:35 PM Chris Hostetter <hossman_luc...@fucit.org> wrote: > On Thu, 3 Oct 2024, Gus Heck wrote: > > : The failures I saw when I downloaded a couple logs centered on threads > not > : terminated. Perhaps Uwe's box is so overloaded that the shutdown process > : for those tests takes too long and the test fails instead? > > They are specifically coming from Uwe's box when using the openJ9 JVM. > > Cross posting from another thread last week... > > > Date: Mon, 14 Oct 2024 12:54:10 -0700 (MST) > From: Chris Hostetter <hossman_luc...@fucit.org> > To: dev@solr.apache.org > Subject: Re: [JENKINS] Solr-main-Linux (64bit/openj9/jdk-17.0.8) - Build # > 20654 - Still Unstable! > Message-ID: <alpine.DEB.2.21.2410141247570.20696@slate> > > Uwe: > > We've been seeing an epic number of these types of TimerThread "leak" > failures from your jenkins box in the past few week -- all that i've seen > have a thread name "file lock watchdog" and seem to be run on "openj9" > > Are these possibly related to an openJ9 upgrade you made to your boxes ? > ... maybe back in september? > > Some random googling suggests the sysprop below might be useful to disable > this watchdog -- can you try setting this in your jenkins gradle > command options? > > -Dcom.ibm.tools.attach.useFileLockWatchdog=false > > > https://github.com/eclipse-openj9/openj9/commit/f40f665db811f7686dd61d32c1e7c140ab35d78a > > > > > : > : On Thu, Oct 3, 2024 at 3:47 PM David Smiley <dsmi...@apache.org> wrote: > : > : > Relying on people to go look at CI out of the goodness of our hearts > is a > : > losing strategy. Our contributors don't even know where that is! > There > : > needs to be a trigger to do so ideally something personalized -- a > build > : > failure with recent changes that *you* included. Or instead a > post/comment > : > on linked JIRA or PR -- gets contributor involvement even if the Git > : > metadata iacks a real email address. > : > > : > On Thu, Oct 3, 2024 at 2:46 PM Houston Putman <hous...@apache.org> > wrote: > : > > : > > The failures generally seem to be coming from Uwe's boxes, and I > cannot > : > > reproduce them locally. The crossDc ones do seem to be failing a > lot, but > : > > when they fail, it looks like they aren't failing alone. I will > continue > : > to > : > > do research on it though. > : > > > : > > Our tests are extremely flakey right now, so it's definitely > something we > : > > need to clean up quickly. Thanks for pointing it out. > : > > > : > > - Houston > : > > > : > > On Thu, Oct 3, 2024 at 12:22 PM Gus Heck <gus.h...@gmail.com> wrote: > : > > > : > > > I went to the fucit jenkins reports site to check on the state of > the > : > > build > : > > > after my recent commit to make sure all was well, but when I got > there > : > I > : > > > was greeted with several weeks of extremely frequent test failures > and > : > in > : > > > the last 2 weeks we seem to have gained several 100% failures (that > : > > clearly > : > > > preceded my commit). > : > > > > : > > > http://fucit.org/solr-jenkins-reports/failure-report.html > : > > > > : > > > This appears to be on fire. > : > > > > : > > > Clear culprits include the addition of the crossdc module and some > : > > problems > : > > > with lucene back compatibility indexes There also seems to be a big > : > > uptick > : > > > in recovery related failures. > : > > > > : > > > It would be nice if one could filter fucit somehow to see only > lucene > : > or > : > > > only solr, though I imagine that's not a minor undertaking > : > > > > : > > > -Gus > : > > > > : > > > -- > : > > > http://www.needhamsoftware.com (work) > : > > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > : > > > > : > > > : > > : > : > : -- > : http://www.needhamsoftware.com (work) > : https://a.co/d/b2sZLD9 (my fantasy fiction book) > : > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org