Re: Replicas for same shard not in sync

Erick Erickson Mon, 25 Apr 2016 12:15:04 -0700

Ted:
Yes, deleting and re-adding the replica will be fine.

Having commits happen from the client when you _also_ have
autocommits that frequently (10 seconds and 1 second are pretty
aggressive BTW) is usually not recommended or necessary.


David:

bq: if one or more replicas are down, updates presented to the leader
still succeed, right?  If so, tedsolr is correct that the Solr client
app needs to re-issue update....

Absolutely not the case. When the replicas are down, they're marked as
down by Zookeeper. When then come back up they find the leader through
Zookeeper magic and ask, essentially "Did I miss any updates"? If the
replica did miss any updates it gets them from the leader either
through the leader replaying the updates from its transaction log to
the replica or by replicating the entire index from the leader. Which
path is followed is a function of how far behind the replica is.

In this latter case, any updates that come in to the leader while the
replication is happening are buffered and replayed on top of the index
when the full replication finishes.

The net-net here is that you should not have to track whether updates
got to all the replicas or not. One of the major advantages of
SolrCloud is to remove that worry from the indexing client...

Best,
Erick

On Mon, Apr 25, 2016 at 11:39 AM, David Smith
<dsmiths...@yahoo.com.invalid> wrote:
> Erick,
>
> So that my understanding is correct, let me ask, if one or more replicas are 
> down, updates presented to the leader still succeed, right?  If so, tedsolr 
> is correct that the Solr client app needs to re-issue updates, if it wants 
> stronger guarantees on replica consistency than what Solr provides.
>
> The “Write Fault Tolerance” section of the Solr Wiki makes what I believe is 
> the same point:
>
> "On the client side, if the achieved replication factor is less than the 
> acceptable level, then the client application can take additional measures to 
> handle the degraded state. For instance, a client application may want to 
> keep a log of which update requests were sent while the state of the 
> collection was degraded and then resend the updates once the problem has been 
> resolved."
>
>
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>
>
> Kind Regards,
>
> David
>
>
>
>
> On 4/25/16, 11:57 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>>bq: I also read that it's up to the
>>client to keep track of updates in case commits don't happen on all the
>>replicas.
>>
>>This is not true. Or if it is it's a bug.
>>
>>The update cycle is this:
>>1> updates get to the leader
>>2> updates are sent to all followers and indexed on the leader as well
>>3> each replica writes the updates to the local transaction log
>>4> all the replicas ack back to the leader
>>5> the leader responds to the client.
>>
>>At this point, all the replicas for the shard have the docs locally
>>and can take over as leader.
>>
>>You may be confusing indexing in batches and having errors with
>>updates getting to replicas. When you send a batch of docs to Solr,
>>if one of them fails indexing some of the rest of the docs may not
>>be indexed. See SOLR-445 for some work on this front.
>>
>>That said, bouncing servers willy-nilly during heavy indexing, especially
>>if the indexer doesn't know enough to retry if an indexing attempt fails may
>>be the root cause here. Have you verified that your indexing program
>>retries in the event of failure?
>>
>>Best,
>>Erick
>>
>>On Mon, Apr 25, 2016 at 6:13 AM, tedsolr <tsm...@sciquest.com> wrote:
>>> I've done a bit of reading - found some other posts with similar questions.
>>> So I gather "Optimizing" a collection is rarely a good idea. It does not
>>> need to be condensed to a single segment. I also read that it's up to the
>>> client to keep track of updates in case commits don't happen on all the
>>> replicas. Solr will commit and return success as long as one replica gets
>>> the update.
>>>
>>> I have a state where the two replicas for one collection are out of sync.
>>> One has some updates that the other does not. And I don't have log data to
>>> tell me what the differences are. This happened during a maintenance window
>>> when the servers got restarted while a large index job was running. Normally
>>> this doesn't cause a problem, but it did last Thursday.
>>>
>>> What I plan to do is select the replica I believe is incomplete and delete
>>> it. Then add a new one. I was just hoping Solr had a solution for this -
>>> maybe using the ZK transaction logs to replay some updates, or force a
>>> resync between the replicas.
>>>
>>> I will also implement a fix to prevent Solr from restarting unless one of
>>> its config files has changed. No need to bounce Solr just for kicks.
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/Replicas-for-same-shard-not-in-sync-tp4272236p4272602.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Replicas for same shard not in sync

Reply via email to