On Feb 27, 2013, at 2:32 PM, jimtronic <jimtro...@gmail.com> wrote: > It seems odd that the write should succeed on the leader even though it > didn't work on the other nodes.
Currently, a leader does an update locally before sending in parallel to all replicas. If we can't send an update to a replica, because it crashed, or because of some other reason, we ask that replica to recover if we can. In that case, it's either gone and will come back and recover, or oddly, the request failed and it's still in normal operations, in which case we ask it to recover because something must be wrong. So if a leader can't send to any replicas, he's going to assume they are all screwed (they are if he can't send to them) and think he is the only part of the cluster. It might be nice if we had a param for you to say, consider this a fail unless it hits this many replicas - but still the leader is going to have carried out the request. What you need to figure out is why the leader could not talk to the replicas - very weird to not see log errors about that! Were the replicas responding to requests? OOM's are bad for SolrCloud by the way - a JVM that has OOM is outta control - you really want to use the option that kills the jvm on OOMs. - Mark