Thanks Jeff!  A few comments

>>
>> Although you could probably bounce a node and get your document counts back 
>> in sync (by provoking a check)
>>
 

If the check is a simple doc count, that will not work. We have found that 
replica1 and replica3, although they contain the same doc count, don’t have the 
SAME docs.  They each missed at least one update, but of different docs.  This 
also means none of our three replicas are complete.

>>
>>it’s interesting that you’re in this situation. It implies to me that at some 
>>point the leader couldn’t write a doc to one of the replicas,
>>

That is our belief as well. We experienced a datacenter-wide network disruption 
of a few seconds, and user complaints started the first workday after that 
event.  

The most interesting log entry during the outage is this:

"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is 
coming from leader,​ but we are the leader: 
update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2";

>>
>> You might watch the achieved replication factor of your updates and see if 
>> it ever changes
>>

This is a good tip. I’m not sure I like the implication that any failure to 
write all 3 of our replicas must be retried at the app layer.  Is this really 
how SolrCloud applications must be built to survive network partitions without 
data loss? 

Regards,

David


On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:

>
>My understanding is that the "version" represents the timestamp the searcher 
>was opened, so it doesn’t really offer any assurances about your data.
>
>Although you could probably bounce a node and get your document counts back in 
>sync (by provoking a check), it’s interesting that you’re in this situation. 
>It implies to me that at some point the leader couldn’t write a doc to one of 
>the replicas, but that the replica didn’t consider itself down enough to check 
>itself.
>
>You might watch the achieved replication factor of your updates and see if it 
>ever changes:
>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> (See Achieved Replication Factor/min_rf)
>
>If it does, that might give you clues about how this is happening. Also, it 
>might allow you to work around the issue by trying the write again.
>
>
>
>
>
>
>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>
>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen 
>>permanently out of sync.  Users started to complain that the same search, 
>>executed twice, sometimes returned different result counts.  Sure enough, our 
>>replicas are not identical:
>>
>>>> shard1_replica1:  89867 documents / version 1453479763194
>>>> shard1_replica2:  89866 documents / version 1453479763194
>>>> shard1_replica3:  89867 documents / version 1453479763191
>>
>>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>>screen reports all 3 replicas as “Current”.  The last modification to this 
>>collection was 2 hours before I captured this information, and our auto 
>>commit time is 60 seconds.  
>>
>>I have a lot of concerns here, but my first question is if anyone else has 
>>had problems with out of sync replicas, and if so, what they have done to 
>>correct this?
>>
>>Kind Regards,
>>
>>David
>>

Reply via email to