Hi, there, We want to use Solr's Collection Distribution. Here's the question regarding recovery of failures of the scripts. To my understanding:
* if the snapuller fails on a slave, we can possibly implement something like the master would examine the status messages from all slaves and notify all slaves to execute snapinstaller if all statuses are success. * however, if then snapinstaller fails on a slave, there is really no simple operation to rollback so that all slaves can still keep the same old index. Besides, there is usually some hardware, network or simply Solr problems causing the snapinstaller to fail. The problem may prevent any rollback operation to execute, even if there is such an operation. It seems possible to implement a 2-phase commit like protocol to provide automatic recovery to keep all slave indexes consistent at all time. However, one being that I don't see there's an rollback operation for snapinstaller; two this would definitely complicates the system. So looks like all we can do is it monitoring the logs and alarm people to fix the issue and rerun the scripts, etc. whenever failures occur. Is that the correct understanding? Thanks, -Hui