Re: SolrCloud replicas out of sync

Brian Narsi Wed, 27 Jan 2016 06:16:14 -0800

This on the surface appears to be similar to an earlier thread by me: "Query
results change"


On Tue, Jan 26, 2016 at 4:32 PM, Jeff Wartes <jwar...@whitepages.com> wrote:

>
> Ah, perhaps you fell into something like this then?
> https://issues.apache.org/jira/browse/SOLR-7844
>
> That says it’s fixed in 5.4, but that would be an example of a split-brain
> type incident, where different documents were accepted by different
> replicas who each thought they were the leader. If this is the case, and
> you actually have different data on each replica, I’m not aware of any way
> to fix the problem short of reindexing those documents. Before that, you’ll
> probably need to choose a replica and just force the others to get in sync
> with it. I’d choose the current leader, since that’s slightly easier.
>
> Typically, a leader writes an update to it’s transaction log, then sends
> the request to all replicas, and when those all finish it acknowledges the
> update. If a replica gets restarted, and is less than N documents behind,
> the leader will only replay that transaction log. (Where N is the
> numRecordsToKeep configured in the updateLog section of solrconfig.xml)
>
> What you want is to provoke the heavy-duty process normally invoked if a
> replica has missed more than N docs, which essentially does a checksum and
> file copy on all the raw index files. FetchIndex would probably work, but
> it’s a replication handler API originally designed for master/slave
> replication, so take care:
> https://wiki.apache.org/solr/SolrReplication#HTTP_API
> Probably a lot easier would be to just delete the replica and re-create
> it. That will also trigger a full file copy of the index from the leader
> onto the new replica.
>
> I think design decisions around Solr generally use CP as a goal. (I
> sometimes wish I could get more AP behavior!) See posts like this:
> http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/
> So the fact that you encountered this sounds like a bug to me.
> That said, another general recommendation (of mine) is that you not use
> Solr as your primary data source, so you can rebuild your index from
> scratch if you really need to.
>
>
>
>
>
>
> On 1/26/16, 1:10 PM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>
> >Thanks Jeff!  A few comments
> >
> >>>
> >>> Although you could probably bounce a node and get your document counts
> back in sync (by provoking a check)
> >>>
> >
> >
> >If the check is a simple doc count, that will not work. We have found
> that replica1 and replica3, although they contain the same doc count, don’t
> have the SAME docs.  They each missed at least one update, but of different
> docs.  This also means none of our three replicas are complete.
> >
> >>>
> >>>it’s interesting that you’re in this situation. It implies to me that
> at some point the leader couldn’t write a doc to one of the replicas,
> >>>
> >
> >That is our belief as well. We experienced a datacenter-wide network
> disruption of a few seconds, and user complaints started the first workday
> after that event.
> >
> >The most interesting log entry during the outage is this:
> >
> >"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says
> it is coming from leader, but we are the leader:
> update.distrib=FROMLEADER&distrib.from=
> http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2
> "
> >
> >>>
> >>> You might watch the achieved replication factor of your updates and
> see if it ever changes
> >>>
> >
> >This is a good tip. I’m not sure I like the implication that any failure
> to write all 3 of our replicas must be retried at the app layer.  Is this
> really how SolrCloud applications must be built to survive network
> partitions without data loss?
> >
> >Regards,
> >
> >David
> >
> >
> >On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
> >
> >>
> >>My understanding is that the "version" represents the timestamp the
> searcher was opened, so it doesn’t really offer any assurances about your
> data.
> >>
> >>Although you could probably bounce a node and get your document counts
> back in sync (by provoking a check), it’s interesting that you’re in this
> situation. It implies to me that at some point the leader couldn’t write a
> doc to one of the replicas, but that the replica didn’t consider itself
> down enough to check itself.
> >>
> >>You might watch the achieved replication factor of your updates and see
> if it ever changes:
> >>
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> (See Achieved Replication Factor/min_rf)
> >>
> >>If it does, that might give you clues about how this is happening. Also,
> it might allow you to work around the issue by trying the write again.
> >>
> >>
> >>
> >>
> >>
> >>
> >>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID>
> wrote:
> >>
> >>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have
> fallen permanently out of sync.  Users started to complain that the same
> search, executed twice, sometimes returned different result counts.  Sure
> enough, our replicas are not identical:
> >>>
> >>>>> shard1_replica1:  89867 documents / version 1453479763194
> >>>>> shard1_replica2:  89866 documents / version 1453479763194
> >>>>> shard1_replica3:  89867 documents / version 1453479763191
> >>>
> >>>I do not think this discrepancy is going to resolve itself.  The Solr
> Admin screen reports all 3 replicas as “Current”.  The last modification to
> this collection was 2 hours before I captured this information, and our
> auto commit time is 60 seconds.
> >>>
> >>>I have a lot of concerns here, but my first question is if anyone else
> has had problems with out of sync replicas, and if so, what they have done
> to correct this?
> >>>
> >>>Kind Regards,
> >>>
> >>>David
> >>>
> >
>

Re: SolrCloud replicas out of sync

Reply via email to