[ https://issues.apache.org/jira/browse/SOLR-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069426#comment-17069426 ]
Shalin Shekhar Mangar commented on SOLR-14356: ---------------------------------------------- Okay, yes let's add the connect timeout exception and discuss a better fix in SOLR-14368 > PeerSync with hanging nodes > --------------------------- > > Key: SOLR-14356 > URL: https://issues.apache.org/jira/browse/SOLR-14356 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Priority: Major > Attachments: SOLR-14356.patch > > > Right now in {{PeerSync}} (during leader election), in case of exception on > requesting versions to a node, we will skip that node if exception is one the > following type > * ConnectTimeoutException > * NoHttpResponseException > * SocketException > Sometime the other node basically hang but still accept connection. In that > case SocketTimeoutException is thrown and we consider the {{PeerSync}} > process as failed and the whole shard just basically leaderless forever (as > long as the hang node still there). > We can't just blindly adding {{SocketTimeoutException}} to above list, since > [~shalin] mentioned that sometimes timeout can happen because of genuine > reasons too e.g. temporary GC pause. > I think the general idea here is we obey {{leaderVoteWait}} restriction and > retry doing sync with others in case of connection/timeout exception happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org