[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-13486: -------------------------------------- Attachment: SOLR-13486__test.patch Status: Open (was: Open) Based on the renewed interest in this issue recently, I set aside some time last night to try and do 2 things: # Fork TestCloudConsistency into a new simplified test that focused on triggering this "replica peer syncs from leader before leader does tlog replay" more reliably ** Start with a simplified 1 shard 2 replica collection ** add more docs during the partition ** Use TestInjection to excessively delay the tlog replay # Find a work around for this problem in TestCloudConsistency so it passes more reliably While working on #1, I realized there was a question I hadn't considered during my initial analyssi of the jenkins logs showing this failure: {color:#de350b}*Why does the leader _need_ to do tlog replay in the test at all?*{color} Even if the client doesn't expilicitly commit all docs, the "Commit on Close" semantics of Solr's IndexWriter should ensure that a clean shutdown of the leader means all uncommitted docs in the tlog will be automaticaly committed before the Directory is closed – nothing in the test "kills" the leader before this should happen. So WTF? I still haven't gotten to the bottom of that, but I did confirm that: * unlike the "normal" adds for docs 1-3, the code path in TestCloudConsistency that was adding doc #4 (during hte network partition) was *NOT* committing doc#4. * in the test logs where TestCloudConsistency failed, we never see the normal "Committing on IndexWriter close." i would expect from an oderly shutdown of the leader ** This message does appear in the expected location of the logs for a TestCloudConsistency run that passes At first I thought the problem was some other test class running earlier in the same jenkins JVM mucking with the value of the (public static) {{DirectUpdateHandler2.commitOnClose}} prior to the test running – but even when running a single test class locally, with {{DirectUpdateHandler2.commitOnClose = true;}} i was able to continue to reproduce the problem in my new test. ---- The attached patch includes: * my new TestTlogReplayVsRecovery test, marked as {{AwaitsFix}} on this jira. * a trivial fix/workaround for this bug in TestCloudConsistency ** it now ensures that doc #4 is committed before shutting down the leader – eliminating the risk of tlog replay affecting the leader when it restarts I still want to dig into WTF is happening to cause "commit on close" from not working reliably by default, but that seems like a distinct bug/issue. In the mean time this patch should help ensure we don't get hard to reproduce errors in TestCloudConsistency as a result of this bug; while also providing a new test, that more reliably proves this bug exists, that can be used to test whatever fixes people suggest in the future. [~dweiss] / [~erickerickson] does this patch resolve most of the failures you are seeing from TestCloudConsistency ? If you are still seeing the " {{Doc with id=4 not found}} " failures after applying this patch can you please post your logs? (if you are seeing other failures can you please file new jiras and attach the logs there?) If there are no objections I'll check precommit (I'm sure i have some unused imports) and commit soon, then update this jira summary/description to refer to the new test > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > ------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug > Reporter: Chris M. Hostetter > Priority: Major > Attachments: SOLR-13486__test.patch, > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for <URL> during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org