Recent PRS test flakiness

Jason Gerlowski Sun, 02 Oct 2022 13:46:44 -0700

Hey all,

I noticed this week (after running into a handful of test failures locally)
that 3 of our 5 flakiest tests (according to fucit) are trying to test
Solr's "per-replica state" code.  The tests in question are:
PerReplicaStatesIntegrationTest.testRestart,
PerReplicaStatesIntegrationTest.testPerReplicaStateCollection, and
CloudSolrClientTest.testPerReplicaStateCollection.


All three of these saw a big jump in flakiness between Sept 12 and Sept
19.  I spent a bit of time debugging, but didn't get all too far.  In most
failures, the test times out waiting for a particular number of replicas to
be reported in ZooKeeper.  I suspect there's a race condition in how we're
updating our ZK state, but that's as far as I was able to get for now.

So, a few questions:

1. Does anyone know what the root cause of these failures might be, or at
least what might've caused their flakiness to spike in mid-Sept?
2. Should we temporarily @Ignore or @BadApple them to help the builds out a
bit until someone with context can attend to them?

Best,

Jason

Recent PRS test flakiness

Reply via email to