I have an application which implements several different searches against a
solrcloud collection.
We are using Solr 7.2 and Solr 6.1

The collection b2b-catalog-material is created with the default Near Real
Time (NRT) replicas. The collection has 2 shards each with 2 replicas.

The application launches a search and pages through all of the results up
to a maximum, typically about 1000 results and returns them to the caller.
It pages by the standard method of incrementing the start parameter by the
rows. until we retrieve the maximum we need or return all the
hits.Typically we set rows to 200

If a search matches 2000 results, the app will call solr 10 times to
retrieve 200 results per call. This is configurable.

The documents in the collection are product skus but the searchable fields
are mostly product oriented, and we have between 2 and 500 skus per
product. There are about 2,463,442 documents in the collection.

We need the results by relevancy so the application sorts the results by
score desc, and the unique id ascending as the tie breaker

We discovered that the application often returns duplicate records from a
search. I believe that this is due to the NRT replicas having slightly
different index data due to commit orders and different numbers of deleted
records. For many queries we see about 20 to 30 results duplicated. The
results from solr are sent to another system to retrieve pricing
information. This system is not yet fully populated so that out of a 1000
results we may return 350 or so. The problem is each time we called the
application with the same query we would see different results. I saw it
vary between 351 which was correct to 341 and 346. I believe that for each
"duplicate" found by the application, there is also a result that was
missed.

The numberFound from the solr Query response does not vary

This variability in the same query is unacceptable to the business. For
quite a while I thought it was in our code, or in the call to the other
system. However, we now know that it is Solr.

I created a simple test driver that calls solr and pages through the
results. It maintains a set of all the ids that we've encountered and it
will regularly find 20 or more duplicates depending upon the query.

Some observations:
The unique id is unique, it's used in other systems for this data.

If we do an optimize on the collection, the duplicates won't show up until
the next data load

I created a second collection that used the TLOG replica type, and we don't
see the problem even with repeated data loads.


The data in the collection is kept up to date by an etl process that
completely reindexes the data once a week. That would be how it would work
once in production anyway we reload it more frequently as we're testing the
app.

My boss has lost all confidence in Solrcloud. It seems that it cannot find
the same data in subsequent searches. Returning consistent results from a
search is job #1 and solrcloud is failing at that.

It looks like using TLOG replicas seems to address the issue, it appears
that you cannot trust NRT replicas to return consistent results.

The scores for many searches are fairly flat with not a lot of variability
in them, which means that a small difference in a score can change the
order of results.

We found that upgrading to 7.2 in our production servers and using tlog
replicas worked, but the alternative of optimizing after each load while a
hack does seem to address the problem too, however determining when to
optimize would be difficult to automate since we use CDCR to replicate the
data to a cloud environment and it's not easy to determine when the remote
collections are fully loaded.

The only other thing I can think of is tweaking the lucene merge algorithm
to better remove deleted documents from the index

Have others encountered this kind of inconsistency in solrcloud? I cannot
believe that we're the first to have encountered it.

How have you addressed it?

We have settled on using TLOG replicas as they provide consistent results
and don't return duplicate hits, which also means that there are no missing
hits.

Unless you need real time indexing, NRT replicas should be avoided in favor
of TLOG replicas or a mix of TLOG and PULL replicas.

I wrote a test program and verified that we actually have this issue with
all or our collections. We hadn't noticed it before because most of the
time the missing/duplicate results were 5 to 10 pages into the result set.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Reply via email to