[jira] [Updated] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

Chris M. Hostetter (Jira) Fri, 03 Jan 2020 14:30:40 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris M. Hostetter updated SOLR-13486:
--------------------------------------
    Description: 
There is a bug in solr cloud that can result in replicas being out of sync with 
the leader if:
 * The leader has uncommitted docs (in the tlog) that didn't make it to the 
replica
 * The leader restarts
 * The replica begins to peer sync from the leader before the leader finishes 
it's own tlog replay on startup

A "rolling restart" situation is when this is most likeley to affect real world 
users

This was first discovered via hard to reproduce TestCloudConsistency failures 
in jenkins, but that test has since been modified to work around this bug, and 
a new test "TestTlogReplayVsRecovery" has been added that more aggressively 
demonstrates this error.

Original jira description below...
----
I've been investigating some jenkins failures from TestCloudConsistency, which 
at first glance suggest a problem w/replica(s) recovering after a network 
partition from the leader - but in digging into the logs the root cause 
acturally seems to be a thread race conditions when a replica (the leader) is 
first registered...
 * The {{ZkContainer.registerInZk(...)}} method (which is called by 
{{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
run in a background thread (via the {{ZkContainer.coreZkRegister}} 
ExecutorService)
 * {{ZkContainer.registerInZk(...)}} delegates to 
{{ZKController.register(...)}} which is ultimately responsible for checking if 
there are any "old" tlogs on disk, and if so handling the "Replaying tlog for 
<URL> during startup" logic
 * Because this happens in a background thread, other logic/requests can be 
handled by this core/replica in the meantime - before it starts (or while in 
the middle of) replaying the tlogs
 ** Notably: *leader's that have not yet replayed tlogs on startup will 
erroneously respond to RTG / Fingerprint / PeerSync requests from other 
replicas w/incomplete data*

...In general, it seems scary / fishy to me that a replica can (aparently) 
become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
... during startup" logic ... particularly since this can happen even for 
replicas that are/become leaders. It seems like this could potentially cause a 
whole host of problems, only one of which manifests in this particular test 
failure:
 * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog ... 
during startup" check:
 ** replicaX can recognize (via zk terms) that it should be the leader(X)
 ** this leaderX can then instruct some other replicaY to recover from it
 ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
(either on it's own volition, or because it was instructed to by leaderX) in an 
attempt to recover
 *** the responses to these recovery requests will not include updates in the 
tlog files that existed on leaderX prior to startup that hvae not yet been 
replayed
 * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
during startup" can finish
 ** replicaY now thinks it is in sync with leaderX, but leaderX has (replayed) 
updates the other replicas know nothing about

  was:
I've been investigating some jenkins failures from TestCloudConsistency, which 
at first glance suggest a problem w/replica(s) recovering after a network 
partition from the leader - but in digging into the logs the root cause 
acturally seems to be a thread race conditions when a replica (the leader) is 
first registered...
 * The {{ZkContainer.registerInZk(...)}} method (which is called by 
{{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
run in a background thread (via the {{ZkContainer.coreZkRegister}} 
ExecutorService)
 * {{ZkContainer.registerInZk(...)}} delegates to 
{{ZKController.register(...)}} which is ultimately responsible for checking if 
there are any "old" tlogs on disk, and if so handling the "Replaying tlog for 
<URL> during startup" logic
 * Because this happens in a background thread, other logic/requests can be 
handled by this core/replica in the meantime - before it starts (or while in 
the middle of) replaying the tlogs
 ** Notably: *leader's that have not yet replayed tlogs on startup will 
erroneously respond to RTG / Fingerprint / PeerSync requests from other 
replicas w/incomplete data*

...In general, it seems scary / fishy to me that a replica can (aparently) 
become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
... during startup" logic ... particularly since this can happen even for 
replicas that are/become leaders. It seems like this could potentially cause a 
whole host of problems, only one of which manifests in this particular test 
failure:
 * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog ... 
during startup" check:
 ** replicaX can recognize (via zk terms) that it should be the leader(X)
 ** this leaderX can then instruct some other replicaY to recover from it
 ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
(either on it's own volition, or because it was instructed to by leaderX) in an 
attempt to recover
 *** the responses to these recovery requests will not include updates in the 
tlog files that existed on leaderX prior to startup that hvae not yet been 
replayed
 * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
during startup" can finish
 ** replicaY now thinks it is in sync with leaderX, but leaderX has (replayed) 
updates the other replicas know nothing about

        Summary: race condition between leader's "replay on startup" and 
non-leader's "recover from leader" can leave replicas out of sync 
(TestTlogReplayVsRecovery)  (was: race condition between leader's "replay on 
startup" and non-leader's "recover from leader" can leave replicas out of sync 
(TestCloudConsistency))

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync 
> with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the 
> replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes 
> it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real 
> world users
> This was first discovered via hard to reproduce TestCloudConsistency failures 
> in jenkins, but that test has since been modified to work around this bug, 
> and a new test "TestTlogReplayVsRecovery" has been added that more 
> aggressively demonstrates this error.
> Original jira description below...
> ----
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

Reply via email to