[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

ASF subversion and git services (Jira) Sat, 24 Oct 2020 08:54:28 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220133#comment-17220133
 ]


ASF subversion and git services commented on SOLR-14942:
--------------------------------------------------------

Commit b6d06bb309d121901ab6ce1d1935b4067ce610fe in lucene-solr's branch 
refs/heads/branch_8x from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b6d06bb ]

SOLR-14942: Reduce leader election time on node shutdown (#2004)

The shutdown process waits for all replicas/cores to be closed before removing 
the election node of the leader. This can take some time due to index flush or 
merge activities on the leader cores and delays new leaders from being elected. 
Moreover, jetty stops accepting new requests on receiving SIGTERM which means 
that even though a leader technically exists, no new indexing requests can be 
processed by the node. This commit waits for all in-flight indexing requests to 
complete, removes election nodes (thus triggering leader election) and then 
closes all replicas.

Co-authored-by: Cao Manh Dat <da...@apache.org>

(cherry picked from commit 706f284c467becb5f002c05455808ee31aee3465)


> Reduce leader election time on node shutdown
> --------------------------------------------
>
>                 Key: SOLR-14942
>                 URL: https://issues.apache.org/jira/browse/SOLR-14942
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.7.3, 8.6.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Major
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

Reply via email to