Hey all, I noticed this week (after running into a handful of test failures locally) that 3 of our 5 flakiest tests (according to fucit) are trying to test Solr's "per-replica state" code. The tests in question are: PerReplicaStatesIntegrationTest.testRestart, PerReplicaStatesIntegrationTest.testPerReplicaStateCollection, and CloudSolrClientTest.testPerReplicaStateCollection.
All three of these saw a big jump in flakiness between Sept 12 and Sept 19. I spent a bit of time debugging, but didn't get all too far. In most failures, the test times out waiting for a particular number of replicas to be reported in ZooKeeper. I suspect there's a race condition in how we're updating our ZK state, but that's as far as I was able to get for now. So, a few questions: 1. Does anyone know what the root cause of these failures might be, or at least what might've caused their flakiness to spike in mid-Sept? 2. Should we temporarily @Ignore or @BadApple them to help the builds out a bit until someone with context can attend to them? Best, Jason