Chris M. Hostetter created SOLR-14183:
-----------------------------------------
Summary: replicas do not immediately/synchronously reflect
state=RECOVERYING when recieving REQUESTRECOVERY commands
Key: SOLR-14183
URL: https://issues.apache.org/jira/browse/SOLR-14183
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Chris M. Hostetter
Spun off of SOLR-13486: Consider the following situation, which can occur in
{{TestTlogReplayVsRecovery}}
* healthy cluster, healthy shard with multiple replicas
* network partition occurs, leader adds new documents
* network partition is healed, leader is restarted
* leader determines it should be leader again
** sends {{REQUESTRECOVERY}} to replicas
** leader marks itself as {{state=ACTIVE}}
* client checks cluster status and sees all replicas are {{ACTIVE}}
** client assumes all replicas are far game for searching all documents
** *CLIENT FAILS TO FIND EXPECTED DOCUMENTS IF QUERYING NON-LEADER REPLICA*
* asynchronously, non-leader replicas get around to {{doRecovery}}
** only now are non-leader replicas marking themselves as {{state=RECOVERING}}
----
I think we need to reconsider when replicas are marked {{state=RECOVERING}},
either doing it synchronously in {{CoreAdminOperation.REQUESTRECOVERY_OP}}, or
letting the leader set it when the leader knows it needs to initiate recovery,
so that the status is updated and available to clients (and tests) immediately.
Alternatively: we need a more comprehensive way for clients (and tests) to know
if a shard is "healthy" then just checking the state of each replica (since
setting {{state=RECOVERING}} isn't updated in real time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]