[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

Chris M. Hostetter (Jira) Sat, 11 Jan 2020 15:08:26 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013597#comment-17013597
 ]


Chris M. Hostetter commented on SOLR-13486:
-------------------------------------------

I've been revisiting this aspect of my earlier investigation into this bug...
{quote}{color:#de350b}*Why does the leader _need_ to do tlog replay in the test 
at all?*{color}

Even if the client doesn't expilicitly commit all docs, the "Commit on Close" 
semantics of Solr's IndexWriter should ensure that a clean shutdown of the 
leader means all uncommitted docs in the tlog will be automaticaly committed 
before the Directory is closed – nothing in the test "kills" the leader before 
this should happen.

So WTF?

I still haven't gotten to the bottom of that, but I did confirm that:
 * unlike the "normal" adds for docs 1-3, the code path in TestCloudConsistency 
that was adding doc #4 (during hte network partition) was *NOT* committing 
doc#4.
 * in the test logs where TestCloudConsistency failed, we never see the normal 
"Committing on IndexWriter close." i would expect from an oderly shutdown of 
the leader
 ** This message does appear in the expected location of the logs for a 
TestCloudConsistency run that passes

At first I thought the problem was some other test class running earlier in the 
same jenkins JVM mucking with the value of the (public static) 
{{DirectUpdateHandler2.commitOnClose}} prior to the test running – but even 
when running a single test class locally, with 
{{DirectUpdateHandler2.commitOnClose = true;}} i was able to continue to 
reproduce the problem in my new test.
{quote}
I've been trying to get to the bottom of this by modifying 
{{TestTlogReplayVsRecovery}} to explicitly use 
{{DirectUpdateHandler2.commitOnClose = true;}} (as mentioned above) along with 
more detailed logging from org.apache.solr.update (particularly DUH2)

The first thing I realized is that there's a bug in the test where it's 
expecting to find {{uncommittedDocs + uncommittedDocs}} docs, not just 
{{committedDocs + uncommittedDocs}}, which is why it so easily/quickly failed 
for me before.

With that trivial test bug fixed, I have *NOT* been able to reproduce the 
situation that was observed in {{TestCloudConsistency}} when this jira was 
filed: That the leader shutdown (evidently) w/o doing a commitOnClose, 
necessitating tlog replay on startup, which then happenes after a replica did 
recovery.

The only way I can seem to trigger this situation is when 
{{DirectUpdateHandler2.commitOnClose = false;}} (ie: simulating an unclean 
shutdown) suggesting that maybe my original guess about some other test in the 
same JVM borking this seeing was correct ... but I still haven't been able to 
find a test that ran in the same JVM which might be broken in that way
----
The only failure type I've been able to trigger is a new one AFAICT:
 * (partitioned) leader successfully indexes some docs & commits on shutdown
 * leader re-starts, and sends {{REQUESTRECOVERY}} to replica
 * leader marks itself as active
 * test thread detects "all replicas are active" *before* replica has a chance 
to actually go into recovery
 * test thread checks replica for docs that only leader has, and fails

...ironically I've only been able to reproduce this using 
{{TestTlogReplayVsRecovery}} – I've never seen it in {{TestCloudConsistency}} 
even though it seems like that test establishes the same preconditions? 
(Successful logs of {{TestCloudConsistency}} never show a {{REQUESTRECOVERY}} 
command sent to the replicas from the leader, like I see in (both success and 
failure) logs for {{TestTlogReplayVsRecovery}}, so I'm guessing it has to do 
with how many docs are out of sync and what type of recovery is done? ... not 
certain)

My next steps are:
 * Commit a fix for the {{uncommittedDocs + uncommittedDocs}} bug in 
{{TestTlogReplayVsRecovery}}
 ** This will also include some TODOs about making the test more robust with 
more randomized committed & uncommitted docs before/after the network partition
 *** These TODOs aren't really worth pursuing until the underlying bug is fixed
 * Open new jiras for:
 ** Replacing {{DirectUpdateHandler2.commitOnClose}} with something in 
{{TestInjection}} (per comment there)
 *** so we can be more confident tests aren't leaving it in a bad state
 ** Consider setting replica to {{State.RECOVERYING}} synchronously when 
processing {{REQUESTRECOVERY}} command.
 *** w/o this, even if we fix the bug tracked in this issue, it's still 
impossible for tests like {{TestTlogReplayVsRecovery}} – or end users – to set 
CollectionState watchers to know when a collection is healthy in situations 
like the one being tracked in this jira.

After that, i don't think there's any thing else to do until someone smarter 
then me can chime in about fixing the underlying race condition of (leader) 
"tlog replay on startup" vs (replica) "recover from leader".

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync 
> with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the 
> replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes 
> it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real 
> world users
> This was first discovered via hard to reproduce TestCloudConsistency failures 
> in jenkins, but that test has since been modified to work around this bug, 
> and a new test "TestTlogReplayVsRecovery" has been added that more 
> aggressively demonstrates this error.
> Original jira description below...
> ----
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

Reply via email to