[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013614#comment-17013614 ]
Chris M. Hostetter commented on SOLR-13486: ------------------------------------------- New linked jiras: * SOLR-14183: replicas do not immediately/synchronously reflect state=RECOVERYING when recieving REQUESTRECOVERY commands * SOLR-14184: replace DirectUpdateHandler2.commitOnClose with something in TestInjection > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestTlogReplayVsRecovery) > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug > Reporter: Chris M. Hostetter > Priority: Major > Attachments: SOLR-13486__test.patch, > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > There is a bug in solr cloud that can result in replicas being out of sync > with the leader if: > * The leader has uncommitted docs (in the tlog) that didn't make it to the > replica > * The leader restarts > * The replica begins to peer sync from the leader before the leader finishes > it's own tlog replay on startup > A "rolling restart" situation is when this is most likeley to affect real > world users > This was first discovered via hard to reproduce TestCloudConsistency failures > in jenkins, but that test has since been modified to work around this bug, > and a new test "TestTlogReplayVsRecovery" has been added that more > aggressively demonstrates this error. > Original jira description below... > ---- > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for <URL> during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org