This on the surface appears to be similar to an earlier thread by me: "Query results change"
On Tue, Jan 26, 2016 at 4:32 PM, Jeff Wartes <jwar...@whitepages.com> wrote: > > Ah, perhaps you fell into something like this then? > https://issues.apache.org/jira/browse/SOLR-7844 > > That says it’s fixed in 5.4, but that would be an example of a split-brain > type incident, where different documents were accepted by different > replicas who each thought they were the leader. If this is the case, and > you actually have different data on each replica, I’m not aware of any way > to fix the problem short of reindexing those documents. Before that, you’ll > probably need to choose a replica and just force the others to get in sync > with it. I’d choose the current leader, since that’s slightly easier. > > Typically, a leader writes an update to it’s transaction log, then sends > the request to all replicas, and when those all finish it acknowledges the > update. If a replica gets restarted, and is less than N documents behind, > the leader will only replay that transaction log. (Where N is the > numRecordsToKeep configured in the updateLog section of solrconfig.xml) > > What you want is to provoke the heavy-duty process normally invoked if a > replica has missed more than N docs, which essentially does a checksum and > file copy on all the raw index files. FetchIndex would probably work, but > it’s a replication handler API originally designed for master/slave > replication, so take care: > https://wiki.apache.org/solr/SolrReplication#HTTP_API > Probably a lot easier would be to just delete the replica and re-create > it. That will also trigger a full file copy of the index from the leader > onto the new replica. > > I think design decisions around Solr generally use CP as a goal. (I > sometimes wish I could get more AP behavior!) See posts like this: > http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/ > So the fact that you encountered this sounds like a bug to me. > That said, another general recommendation (of mine) is that you not use > Solr as your primary data source, so you can rebuild your index from > scratch if you really need to. > > > > > > > On 1/26/16, 1:10 PM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote: > > >Thanks Jeff! A few comments > > > >>> > >>> Although you could probably bounce a node and get your document counts > back in sync (by provoking a check) > >>> > > > > > >If the check is a simple doc count, that will not work. We have found > that replica1 and replica3, although they contain the same doc count, don’t > have the SAME docs. They each missed at least one update, but of different > docs. This also means none of our three replicas are complete. > > > >>> > >>>it’s interesting that you’re in this situation. It implies to me that > at some point the leader couldn’t write a doc to one of the replicas, > >>> > > > >That is our belief as well. We experienced a datacenter-wide network > disruption of a few seconds, and user complaints started the first workday > after that event. > > > >The most interesting log entry during the outage is this: > > > >"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says > it is coming from leader, but we are the leader: > update.distrib=FROMLEADER&distrib.from= > http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2 > " > > > >>> > >>> You might watch the achieved replication factor of your updates and > see if it ever changes > >>> > > > >This is a good tip. I’m not sure I like the implication that any failure > to write all 3 of our replicas must be retried at the app layer. Is this > really how SolrCloud applications must be built to survive network > partitions without data loss? > > > >Regards, > > > >David > > > > > >On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote: > > > >> > >>My understanding is that the "version" represents the timestamp the > searcher was opened, so it doesn’t really offer any assurances about your > data. > >> > >>Although you could probably bounce a node and get your document counts > back in sync (by provoking a check), it’s interesting that you’re in this > situation. It implies to me that at some point the leader couldn’t write a > doc to one of the replicas, but that the replica didn’t consider itself > down enough to check itself. > >> > >>You might watch the achieved replication factor of your updates and see > if it ever changes: > >> > https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance > (See Achieved Replication Factor/min_rf) > >> > >>If it does, that might give you clues about how this is happening. Also, > it might allow you to work around the issue by trying the write again. > >> > >> > >> > >> > >> > >> > >>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> > wrote: > >> > >>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have > fallen permanently out of sync. Users started to complain that the same > search, executed twice, sometimes returned different result counts. Sure > enough, our replicas are not identical: > >>> > >>>>> shard1_replica1: 89867 documents / version 1453479763194 > >>>>> shard1_replica2: 89866 documents / version 1453479763194 > >>>>> shard1_replica3: 89867 documents / version 1453479763191 > >>> > >>>I do not think this discrepancy is going to resolve itself. The Solr > Admin screen reports all 3 replicas as “Current”. The last modification to > this collection was 2 hours before I captured this information, and our > auto commit time is 60 seconds. > >>> > >>>I have a lot of concerns here, but my first question is if anyone else > has had problems with out of sync replicas, and if so, what they have done > to correct this? > >>> > >>>Kind Regards, > >>> > >>>David > >>> > > >