[ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-13486:
--------------------------------------
    Attachment: SOLR-13486__test.patch
        Status: Open  (was: Open)

Based on the renewed interest in this issue recently, I set aside some time 
last night to try and do 2 things:
 # Fork TestCloudConsistency into a new simplified test that focused on 
triggering this "replica peer syncs from leader before leader does tlog replay" 
more reliably
 ** Start with a simplified 1 shard 2 replica collection
 ** add more docs during the partition
 ** Use TestInjection to excessively delay the tlog replay
 # Find a work around for this problem in TestCloudConsistency so it passes 
more reliably

While working on #1, I realized there was a question I hadn't considered during 
my initial analyssi of the jenkins logs showing this failure:

{color:#de350b}*Why does the leader _need_ to do tlog replay in the test at 
all?*{color}

Even if the client doesn't expilicitly commit all docs, the "Commit on Close" 
semantics of Solr's IndexWriter should ensure that a clean shutdown of the 
leader means all uncommitted docs in the tlog will be automaticaly committed 
before the Directory is closed – nothing in the test "kills" the leader before 
this should happen.

So WTF?

I still haven't gotten to the bottom of that, but I did confirm that:
 * unlike the "normal" adds for docs 1-3, the code path in TestCloudConsistency 
that was adding doc #4 (during hte network partition) was *NOT* committing 
doc#4.
 * in the test logs where TestCloudConsistency failed, we never see the normal 
"Committing on IndexWriter close." i would expect from an oderly shutdown of 
the leader
 ** This message does appear in the expected location of the logs for a 
TestCloudConsistency run that passes

At first I thought the problem was some other test class running earlier in the 
same jenkins JVM mucking with the value of the (public static) 
{{DirectUpdateHandler2.commitOnClose}} prior to the test running – but even 
when running a single test class locally, with 
{{DirectUpdateHandler2.commitOnClose = true;}} i was able to continue to 
reproduce the problem in my new test.
----
The attached patch includes:
 * my new TestTlogReplayVsRecovery test, marked as {{AwaitsFix}} on this jira.
 * a trivial fix/workaround for this bug in TestCloudConsistency
 ** it now ensures that doc #4 is committed before shutting down the leader – 
eliminating the risk of tlog replay affecting the leader when it restarts

I still want to dig into WTF is happening to cause "commit on close" from not 
working reliably by default, but that seems like a distinct bug/issue.

In the mean time this patch should help ensure we don't get hard to reproduce 
errors in TestCloudConsistency as a result of this bug; while also providing a 
new test, that more reliably proves this bug exists, that can be used to test 
whatever fixes people suggest in the future.

[~dweiss] / [~erickerickson] does this patch resolve most of the failures you 
are seeing from TestCloudConsistency ? If you are still seeing the " {{Doc with 
id=4 not found}} " failures after applying this patch can you please post your 
logs? (if you are seeing other failures can you please file new jiras and 
attach the logs there?)

If there are no objections I'll check precommit (I'm sure i have some unused 
imports) and commit soon, then update this jira summary/description to refer to 
the new test

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestCloudConsistency)
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-13486__test.patch, 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to