[
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007744#comment-17007744
]
Erick Erickson commented on SOLR-13486:
---------------------------------------
[~hossman] What I'm seeing is different, and I haven't a clue what is
happening. On the surface, it looks similar because the "assertDocExists"
method fails. However, what I can reliably reproduce in asertDocExists is:
{code}
Server refused connection at:
https://127.0.0.1:49190/solr/outOfSyncReplicasCannotBecomeLeader-false
org.apache.solr.client.solrj.SolrServerException: Server refused connection at:
https://127.0.0.1:49190/solr/outOfSyncReplicasCannotBecomeLeader-false
{code}
This problem happens on one of the _followers_, BTW, after it apparently
successfully syncs after being restarted.
So that's what I'm looking at over on the linked JIRA. And I can only reliably
make it happen on my MBP, my Mac Pro doesn't seem to generate this. "Reliably"
is a bit of a misnomer, it's really between 3 and 10 times per thousand tests.
I can also make the "connection refused" error happen when beasting only a
single test.
So please go ahead and push anything you think will help for the commit issue.
AFAIK, there are multiple issues and it looks like I conflated what I'm seeing
with the original problem. I'll take whatever else I find over to the linked
JIRA.
It'd be really weird if this fix also fixed my other issue...
> race condition between leader's "replay on startup" and non-leader's "recover
> from leader" can leave replicas out of sync (TestCloudConsistency)
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-13486
> URL: https://issues.apache.org/jira/browse/SOLR-13486
> Project: Solr
> Issue Type: Bug
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments: SOLR-13486__test.patch,
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz,
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz,
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> I've been investigating some jenkins failures from TestCloudConsistency,
> which at first glance suggest a problem w/replica(s) recovering after a
> network partition from the leader - but in digging into the logs the root
> cause acturally seems to be a thread race conditions when a replica (the
> leader) is first registered...
> * The {{ZkContainer.registerInZk(...)}} method (which is called by
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically
> run in a background thread (via the {{ZkContainer.coreZkRegister}}
> ExecutorService)
> * {{ZkContainer.registerInZk(...)}} delegates to
> {{ZKController.register(...)}} which is ultimately responsible for checking
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog
> for <URL> during startup" logic
> * Because this happens in a background thread, other logic/requests can be
> handled by this core/replica in the meantime - before it starts (or while in
> the middle of) replaying the tlogs
> ** Notably: *leader's that have not yet replayed tlogs on startup will
> erroneously respond to RTG / Fingerprint / PeerSync requests from other
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently)
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog
> ... during startup" logic ... particularly since this can happen even for
> replicas that are/become leaders. It seems like this could potentially cause
> a whole host of problems, only one of which manifests in this particular test
> failure:
> * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog
> ... during startup" check:
> ** replicaX can recognize (via zk terms) that it should be the leader(X)
> ** this leaderX can then instruct some other replicaY to recover from it
> ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX
> (either on it's own volition, or because it was instructed to by leaderX) in
> an attempt to recover
> *** the responses to these recovery requests will not include updates in the
> tlog files that existed on leaderX prior to startup that hvae not yet been
> replayed
> * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ...
> during startup" can finish
> ** replicaY now thinks it is in sync with leaderX, but leaderX has
> (replayed) updates the other replicas know nothing about
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]