: So looks like all we can do is it monitoring the logs and alarm people to : fix the issue and rerun the scripts, etc. whenever failures occur. Is that : the correct understanding?
I have *never* seen snappuller or snapinstaller fail (except during an initial rollout of Solr when i forgot to setup the neccessary ssh keys). I suppose we could at an option to snapinstaller to support explicitly installing a snapshot by name ... then if you detect that salve Z didn't load the latest snapshot, you could always tell the other slaves to snapinstall whatever older version slave Z is still using -- but frankly that seems a little silly -- not to mention that if you couldn't load the snapshot into Z, odds are Z isn't responding to queries either. a better course of action might just be to have an automated system which monitors the distribution status info on the master, and takes any slaves that don't update it properly out of your load balances rotation (and notifies people to look into it) -Hoss