[ 
https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-13945:
----------------------------------------
    Attachment: SOLR-13945.patch
        Status: Open  (was: Open)

Attaching a WIP patch getting rid of the commit() on parent shard that happens 
after the state changes have already taken place. Please note that I've removed 
it because I wasn't able to understand why it was put there in the first place 
(SOLR-7673) and it just doesn't feel right.

Also, in a subsequent patch, I would like to modify the rollback thing to do 
proper state checks before setting back the state and deleting the subshards, 
such that there is no data loss.

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state 
> changes* have happened, i.e. from active/construction/construction to 
> inactive/active/active. Please see 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called 
> "cleanupAfterFailure" in the finally block that resets the state to 
> active/construction/construction. Please see: 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any 
> documents that went into the subshards (because they are already active by 
> now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is 
> inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already 
> active and parent is inactive, then what is the need for setting them back to 
> construction? Seems like a crucial check is missing there. Also, why do we 
> reset the subshard status back to construction instead of inactive? It is 
> extremely misleading (and, frankly, ridiculous) for any external clusterstate 
> monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to 
> CONSTRUCTION and then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to