[
https://issues.apache.org/jira/browse/SOLR-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068520#comment-17068520
]
Cao Manh Dat edited comment on SOLR-14356 at 3/27/20, 10:27 AM:
----------------------------------------------------------------
Second thought, I think it is sufficient now to just add the Exception to the
list and revisit the retry problem in another issue. The reason here is
* We already count ConnectTimeoutException as success
* The more I think about how a replica count on the result of SyncStrategy to
become leader the more I feel it error-prone. Will open another issue for this.
[~shalin] WDYT? Opened SOLR-14368
was (Author: caomanhdat):
Second thought, I think it is sufficient now to just add the Exception to the
list and revisit the retry problem in another issue. The reason here is
* We already count ConnectTimeoutException as success
* The more I think about how a replica count on the result of SyncStrategy to
become leader the more I feel it error-prone. Will open another issue for this.
[~shalin] WDYT?
> PeerSync with hanging nodes
> ---------------------------
>
> Key: SOLR-14356
> URL: https://issues.apache.org/jira/browse/SOLR-14356
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Cao Manh Dat
> Priority: Major
> Attachments: SOLR-14356.patch
>
>
> Right now in {{PeerSync}} (during leader election), in case of exception on
> requesting versions to a node, we will skip that node if exception is one the
> following type
> * ConnectTimeoutException
> * NoHttpResponseException
> * SocketException
> Sometime the other node basically hang but still accept connection. In that
> case SocketTimeoutException is thrown and we consider the {{PeerSync}}
> process as failed and the whole shard just basically leaderless forever (as
> long as the hang node still there).
> We can't just blindly adding {{SocketTimeoutException}} to above list, since
> [~shalin] mentioned that sometimes timeout can happen because of genuine
> reasons too e.g. temporary GC pause.
> I think the general idea here is we obey {{leaderVoteWait}} restriction and
> retry doing sync with others in case of connection/timeout exception happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]