> > > Currently, a leader does an update locally before sending in parallel to > all replicas. If we can't send an update to a replica, because it crashed, > or because of some other reason, we ask that replica to recover if we can. > In that case, it's either gone and will come back and recover, or oddly, > the request failed and it's still in normal operations, in which case we > ask it to recover because something must be wrong. > > So if a leader can't send to any replicas, he's going to assume they are > all screwed (they are if he can't send to them) and think he is the only > part of the cluster. >
It might be nice if we had a param for you to say, consider this a fail > unless it hits this many replicas - but still the leader is going to have > carried out the request. > > This seems to violate the strong consistency model doesn't it? If a write doesn't succeed at a replica, it shouldn't succeed anywhere. Cassandra seems to have this same problem -- http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure-- except that it returns a timeout error and saves the hint for later. I was assuming that solr was acting like CONSISTENCY ALL for writes and CONSISTENCY ANY for reads. If that were the case, I'd like to ensure that my nodes don't get out of sync if an otherwise healthy node can't perform the update and that the original write would be rolled back. > What you need to figure out is why the leader could not talk to the > replicas - very weird to not see log errors about that! > > Were the replicas responding to requests? > OOM's are bad for SolrCloud by the way - a JVM that has OOM is outta > control - you really want to use the option that kills the jvm on OOMs. > > This does seem to be the biggest problem. The replica was responding normally. I'll try upping the memory and getting the latest version. > > > - Mark > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://lucene.472066.n3.nabble.com/Nodes-out-of-sync-deletes-fail-tp4043433p4043467.html > To unsubscribe from Nodes out of sync, deletes fail, click > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4043433&code=amltdHJvbmljQGdtYWlsLmNvbXw0MDQzNDMzfDEzMjQ4NDk0MTQ=> > . > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://lucene.472066.n3.nabble.com/Nodes-out-of-sync-deletes-fail-tp4043433p4043478.html Sent from the Solr - User mailing list archive at Nabble.com.