Re: Different docs order in different replicas of the same shard

Shawn Heisey Fri, 25 May 2018 08:12:32 -0700

On 5/25/2018 7:28 AM, SOLR4189 wrote:
> I use SOLR-6.5.1 and I want to start to use replicas.
>
> For it I want to understand something:
>
> 1) Can asynchronous forwarding document from leader to all replicas or some
> another reasons cause that replica A may see update X then Y, and replica B
> may see update Y then X? 
> If yes, thus a particular document in replicaA might sort differently
> relative to a document from replicaB if they have the same score (in the
> same order as they were stored in the index). Is it an edge case?


I can't speak about whether it's possible to have updates re-ordered. 
It probably is possible.  But whether it's possible or not, there's
absolutely no guarantee that Lucene document ordering will be identical
between NRT replicas.  NRT is the only replica type that Solr 6.x has,
and is the default type on Solr 7.x.  One replica can have different
numbers of deleted documents than another replica, and may not merge
segments in exactly the same way as another replica.

Because deleted documents can affect score calculation, and one replica
may have different deleted documents than another replica, the default
sort order (relevancy ranking) can differ between replicas.

A workaround to these issues is to always use an explicit field-based
sort.  Deleted documents and the Lucene document order do not affect
that kind of sort.

> 2) What does it mean  Custom update chain post-processors may never be
> invoked on a recovering replica
> <https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>

The name of the update chain that was originally used during the
indexing is not stored in the transaction log, so when the transaction
log is replayed, the update chain is not called.

> if all my UpdateProcessors are post-processors (i.e. are after
> DistributedUpdateProcessor)? Will all buffered update requests in recovery
> be indexed in replica without my features?

General advice: In most cases, a post-processor is NOT a good idea.

Changes made to the input document by update processors placed *before*
DistributedUpdateProcessor will be recorded in the transaction log, and
will be identical on all replicas.  Because the transaction log DOES
have the results of the processor, and all replicas are guaranteed to be
the same, this is almost always what you want.

Placing an update processor before DistributedUpdateProcessor ensures
that it is only run once for every document.  If it is placed after
DistributedUpdateProcessor, it will execute once for every replica on
every document.  That can be a big problem if the update processor runs
slowly or consumes a lot of memory/CPU resources.

Because post-processors run independently on every replica, they can
result in different data on each replica. For instance, if you use the
UUID processor after DistributedUpdateProcessor, every replica will end
up with a different UUID for the same document.  Similarly, the
timestamp processor can record a different timestamp on every replica
for the same document, because each replica might do its indexing at a
slightly different time.  Timestamps in a Solr index have millisecond
precision.

If you actually do intend to have different data in a field on different
replicas, then you might want a post-processor.  But this requirement is
VERY rare.

Thanks,
Shawn

Re: Different docs order in different replicas of the same shard

Reply via email to