Both are excellent points and I will look to implement them. Particularly I wonder if a respectable increase to the numRecordsToKeep param could solve this problem entirely.
Thanks! > On Oct 27, 2015, at 20:50, Jeff Wartes <jwar...@whitepages.com> wrote: > > > On the face of it, your scenario seems plausible. I can offer two pieces > of info that may or may not help you: > > 1. A write request to Solr will not be acknowledged until an attempt has > been made to write to all relevant replicas. So, B won’t ever be missing > updates that were applied to A, unless communication with B was disrupted > somehow at the time of the update request. You can add a min_rf param to > your write request, in which case the response will tell you how many > replicas received the update, but it’s still up to your indexer client to > decide what to do if that’s less than your replication factor. > > See > https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+ > Tolerance for more info. > > 2. There are two forms of replication. The usual thing is for the leader > for each shard to write an update to all replicas before acknowledging the > write itself, as above. If a replica is less than N docs behind the > leader, the leader can replay those docs to the replica from its > transaction log. If a replica is more than N docs behind though, it falls > back to the replication handler recovery mode you mention, and attempts to > re-sync the whole shard from the leader. > The default N for this is 100, which is pretty low for a high-update-rate > index. It can be changed by increasing the size of the transaction log, > (via numRecordsToKeep) but be aware that a large transaction log size can > delay node restart. > > See > https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf > ig#UpdateHandlersinSolrConfig-TransactionLog for more info. > > > Hope some of that helps, I don’t know a way to say > delete-first-on-recovery. > > > > On 10/27/15, 5:21 PM, "Brian Scholl" <bsch...@legendary.com> wrote: > >> Whoops, in the description of my setup that should say 2 replicas per >> shard. Every server has a replica. >> >> >>> On Oct 27, 2015, at 20:16, Brian Scholl <bsch...@legendary.com> wrote: >>> >>> Hello, >>> >>> I am experiencing a failure mode where a replica is unable to recover >>> and it will try to do so forever. In writing this email I want to make >>> sure that I haven't missed anything obvious or missed a configurable >>> option that could help. If something about this looks funny, I would >>> really like to hear from you. >>> >>> Relevant details: >>> - solr 5.3.1 >>> - java 1.8 >>> - ubuntu linux 14.04 lts >>> - the cluster is composed of 1 SolrCloud collection with 100 shards >>> backed by a 3 node zookeeper ensemble >>> - there are 200 solr servers in the cluster, 1 replica per shard >>> - a shard replica is larger than 50% of the available disk >>> - ~40M docs added per day, total indexing time is 8-10 hours spread >>> over the day >>> - autoCommit is set to 15s >>> - softCommit is not defined >>> >>> I think I have traced the failure to the following set of events but >>> would appreciate feedback: >>> >>> 1. new documents are being indexed >>> 2. the leader of a shard, server A, fails for any reason (java crashes, >>> times out with zookeeper, etc) >>> 3. zookeeper promotes the other replica of the shard, server B, to the >>> leader position and indexing resumes >>> 4. server A comes back online (typically 10s of seconds later) and >>> reports to zookeeper >>> 5. zookeeper tells server A that it is no longer the leader and to sync >>> with server B >>> 6. server A checks with server B but finds that server B's index >>> version is different from its own >>> 7. server A begins replicating a new copy of the index from server B >>> using the (legacy?) replication handler >>> 8. the original index on server A was not deleted so it runs out of >>> disk space mid-replication >>> 9. server A throws an error, deletes the partially replicated index, >>> and then tries to replicate again >>> >>> At this point I think steps 6 => 9 will loop forever >>> >>> If the actual errors from solr.log are useful let me know, not doing >>> that now for brevity since this email is already pretty long. In a >>> nutshell and in order, on server A I can find the error that took it >>> down, the post-recovery instruction from ZK to unregister itself as a >>> leader, the corrupt index error message, and then the (start - whoops, >>> out of disk- stop) loop of the replication messages. >>> >>> I first want to ask if what I described is possible or did I get lost >>> somewhere along the way reading the docs? Is there any reason to think >>> that solr should not do this? >>> >>> If my version of events is feasible I have a few other questions: >>> >>> 1. What happens to the docs that were indexed on server A but never >>> replicated to server B before the failure? Assuming that the replica on >>> server A were to complete the recovery process would those docs appear >>> in the index or are they gone for good? >>> >>> 2. I am guessing that the corrupt replica on server A is not deleted >>> because it is still viable, if server B had a catastrophic failure you >>> could pick up the pieces from server A. If so is this a configurable >>> option somewhere? I'd rather take my chances on server B going down >>> before replication finishes than be stuck in this state and have to >>> manually intervene. Besides, I have disaster recovery backups for >>> exactly this situation. >>> >>> 3. Is there anything I can do to prevent this type of failure? It >>> seems to me that if server B gets even 1 new document as a leader the >>> shard will enter this state. My only thought right now is to try to >>> stop sending documents for indexing the instant a leader goes down but >>> on the surface this solution sounds tough to implement perfectly (and it >>> would have to be perfect). >>> >>> If you got this far thanks for sticking with me. >>> >>> Cheers, >>> Brian >>> >> >