Thank you Mark for your thoughts. Docker idea was very interesting and it greatly simplified the 'beast' testing, but unfortunately I was still not able to reproduce.
I do however have an idea based on code review, outlined on https://issues.apache.org/jira/browse/SOLR-16848 Would appreciate your thoughts on the analysis. best, alex On Thu, Jun 15, 2023 at 4:55 PM Mark Miller <markrmil...@gmail.com> wrote: > Oh one more good for duplicating that type of fail - run it in docker, or a > VM, or maybe Multipass, and give it anemic resources (though enough that > the test doesn't OOM or something) > > On Thu, Jun 15, 2023 at 5:34 PM Mark Miller <markrmil...@gmail.com> wrote: > > > Why don't you see how it can return null? > > > > I'm looking at an older checkout, but I see JettySolrRunner checking for > > null core containers all over, and I see it passing back null explicitly > in > > at least one case. > > > > When I peek at where that core container might be coming from, I see a > > provider and a field that looks like it's home (which I note doesn't look > > protected by any memory barrier? e.g., volatile, lock, sync). And I see > > that it could start as null. Get set to null on close as well? > > > > So I wonder about that lack of a memory barrier, but there are probably > > plenty of cases where some random jobs/threads are still running past > that > > close as well, is another thought I have. And I bet one of them comes in > > and looks for that core container late, and he's already clocked out. > > > > Older checkout, so I don't know what you are looking at, but if it hasn't > > changed drastically recently, it seems easy to return a null. > > > > If you want to duplicate a situation that might hit - try running the > test > > with 10-20 instances simultaneously looped. > > > > Or loop one, and hammer your system with some unrelated load for a while. > > > > On Thu, Jun 15, 2023 at 4:49 PM Alex Deparvu <stilla...@apache.org> > wrote: > > > >> Hi, > >> > >> I wanted to take a look at the flaky DeleteReplicaTest test. > >> > >> Some background first: > >> - Past 7 days trend: > >> Class: org.apache.solr.cloud.DeleteReplicaTest > >> Method: raceConditionOnDeleteAndRegisterReplica > >> Failures: 15.56% (63 / 405) > >> > >> - Test failure is caused by a NullPointerException: > >> ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr) > >> [n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1 > >> r:core_node4 > >> x:raceDeleteReplicaCollection_shard1_replica_n2] > o.a.s.c.DeleteReplicaTest > >> Failed to delete replica > >> => java.lang.NullPointerException: Cannot invoke > >> "org.apache.solr.core.CoreContainer.getZkController()" because the > return > >> value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()" > is > >> null > >> > >> I am having some trouble reproducing on my local and I don't see how the > >> getCoreContainer() method might return null. Could this be a timing > issue > >> somehow? > >> If anyone has an idea on how to approach this, I would be happy to hear > >> it. > >> > >> best, > >> alex > >> > > > > > > -- > > - MRM > > > > > -- > - MRM >