[
https://issues.apache.org/jira/browse/SOLR-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247646#comment-17247646
]
Mike Drob commented on SOLR-15029:
----------------------------------
I think this can be done a lot more simply than what I was trying to accomplish
at first. If we simply do a leader election, then the current leader will go to
the end of the queue, a new leader will come in. If there continue to be
indexing errors on the given node, then the new leader will increase terms and
the previous one will fall behind.
> Allow Shard Leader to give up leadership gracefully via shard terms
> -------------------------------------------------------------------
>
> Key: SOLR-15029
> URL: https://issues.apache.org/jira/browse/SOLR-15029
> Project: Solr
> Issue Type: Improvement
> Reporter: Mike Drob
> Assignee: Mike Drob
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently we have (via SOLR-12412) that when a leader sees an index writing
> error during an update it will give up leadership by deleting the replica and
> adding a new replica. One stated benefit of this was that because we are
> using the overseer and a known code path, that this is done asynchronous and
> very efficiently.
> I would argue that this approach is too heavy handed.
> In the case of a corrupt index exception, it makes some sense to completely
> delete the index dir and attempt to sync from a good peer. Even in this case,
> however, it might be better to allow fingerprinting and other index delta
> mechanisms take over and allow for a more efficient data transfer.
> In an alternate case where the index error arises due to a disconnected file
> system (possible with shared file systems, i.e. S3, HDFS, some k8s systems)
> and the required solution is some kind of reconnect, then this approach has
> several shortcomings - the core delete and creations are going to fail
> leaving dangling replicas. Further, the data is still present so there is no
> need to do so many extra copies.
> I propose that we bring in a mechanism to give up leadership via the existing
> shard terms language. I believe we would be able to set all replicas
> currently equal to leader term T to T+1, and then trigger a new leader
> election. The current leader would know it is ineligible, while the other
> replicas that were current before the failed update would be eligible. This
> improvement would entail adding an additional possible operation to terms
> state machine.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]