[ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ishan Chattopadhyaya updated SOLR-13945: ---------------------------------------- Attachment: SOLR-13945.patch Status: Open (was: Open) Attaching a WIP patch getting rid of the commit() on parent shard that happens after the state changes have already taken place. Please note that I've removed it because I wasn't able to understand why it was put there in the first place (SOLR-7673) and it just doesn't feel right. Also, in a subsequent patch, I would like to modify the rollback thing to do proper state checks before setting back the state and deleting the subshards, such that there is no data loss. > SPLITSHARD data loss due to "rollback" > -------------------------------------- > > Key: SOLR-13945 > URL: https://issues.apache.org/jira/browse/SOLR-13945 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Ishan Chattopadhyaya > Priority: Major > Attachments: SOLR-13945.patch > > > # As per SOLR-7673, there is a commit on the parent shard *after state > changes* have happened, i.e. from active/construction/construction to > inactive/active/active. Please see > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588 > # Due to SOLR-12509, there's now a cleanup/rollback method called > "cleanupAfterFailure" in the finally block that resets the state to > active/construction/construction. Please see: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657 > # When 2 is entered into due to a failure in 1, we have a situation where any > documents that went into the subshards (because they are already active by > now) are now lost after the parent becomes active. > If my above understanding is correct, I am wondering: > # Why is a commit to parent shard needed *after* the parent shard is > inactive, subshards are now active and the split operation has completed? > # This rollback looks very suspicious. If state of subshards is already > active and parent is inactive, then what is the need for setting them back to > construction? Seems like a crucial check is missing there. Also, why do we > reset the subshard status back to construction instead of inactive? It is > extremely misleading (and, frankly, ridiculous) for any external clusterstate > monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to > CONSTRUCTION and then the subshard disappearing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org