On Feb 27, 2013, at 2:32 PM, jimtronic <jimtro...@gmail.com> wrote:

> It seems odd that the write should succeed on the leader even though it
> didn't work on the other nodes.

Currently, a leader does an update locally before sending in parallel to all 
replicas. If we can't send an update to a replica, because it crashed, or 
because of some other reason, we ask that replica to recover if we can. In that 
case, it's either gone and will come back and recover, or oddly, the request 
failed and it's still in normal operations, in which case we ask it to recover 
because something must be wrong.

So if a leader can't send to any replicas, he's going to assume they are all 
screwed (they are if he can't send to them) and think he is the only part of 
the cluster. 

It might be nice if we had a param for you to say, consider this a fail unless 
it hits this many replicas - but still the leader is going to have carried out 
the request.

What you need to figure out is why the leader could not talk to the replicas - 
very weird to not see log errors about that!

Were the replicas responding to requests?

OOM's are bad for SolrCloud by the way - a JVM that has OOM is outta control  - 
you really want to use the option that kills the jvm on OOMs.



- Mark

Reply via email to