I have an application which implements several different searches against a solrcloud collection. We are using Solr 7.2 and Solr 6.1
The collection b2b-catalog-material is created with the default Near Real Time (NRT) replicas. The collection has 2 shards each with 2 replicas. The application launches a search and pages through all of the results up to a maximum, typically about 1000 results and returns them to the caller. It pages by the standard method of incrementing the start parameter by the rows. until we retrieve the maximum we need or return all the hits.Typically we set rows to 200 If a search matches 2000 results, the app will call solr 10 times to retrieve 200 results per call. This is configurable. The documents in the collection are product skus but the searchable fields are mostly product oriented, and we have between 2 and 500 skus per product. There are about 2,463,442 documents in the collection. We need the results by relevancy so the application sorts the results by score desc, and the unique id ascending as the tie breaker We discovered that the application often returns duplicate records from a search. I believe that this is due to the NRT replicas having slightly different index data due to commit orders and different numbers of deleted records. For many queries we see about 20 to 30 results duplicated. The results from solr are sent to another system to retrieve pricing information. This system is not yet fully populated so that out of a 1000 results we may return 350 or so. The problem is each time we called the application with the same query we would see different results. I saw it vary between 351 which was correct to 341 and 346. I believe that for each "duplicate" found by the application, there is also a result that was missed. The numberFound from the solr Query response does not vary This variability in the same query is unacceptable to the business. For quite a while I thought it was in our code, or in the call to the other system. However, we now know that it is Solr. I created a simple test driver that calls solr and pages through the results. It maintains a set of all the ids that we've encountered and it will regularly find 20 or more duplicates depending upon the query. Some observations: The unique id is unique, it's used in other systems for this data. If we do an optimize on the collection, the duplicates won't show up until the next data load I created a second collection that used the TLOG replica type, and we don't see the problem even with repeated data loads. The data in the collection is kept up to date by an etl process that completely reindexes the data once a week. That would be how it would work once in production anyway we reload it more frequently as we're testing the app. My boss has lost all confidence in Solrcloud. It seems that it cannot find the same data in subsequent searches. Returning consistent results from a search is job #1 and solrcloud is failing at that. It looks like using TLOG replicas seems to address the issue, it appears that you cannot trust NRT replicas to return consistent results. The scores for many searches are fairly flat with not a lot of variability in them, which means that a small difference in a score can change the order of results. We found that upgrading to 7.2 in our production servers and using tlog replicas worked, but the alternative of optimizing after each load while a hack does seem to address the problem too, however determining when to optimize would be difficult to automate since we use CDCR to replicate the data to a cloud environment and it's not easy to determine when the remote collections are fully loaded. The only other thing I can think of is tweaking the lucene merge algorithm to better remove deleted documents from the index Have others encountered this kind of inconsistency in solrcloud? I cannot believe that we're the first to have encountered it. How have you addressed it? We have settled on using TLOG replicas as they provide consistent results and don't return duplicate hits, which also means that there are no missing hits. Unless you need real time indexing, NRT replicas should be avoided in favor of TLOG replicas or a mix of TLOG and PULL replicas. I wrote a test program and verified that we actually have this issue with all or our collections. We hadn't noticed it before because most of the time the missing/duplicate results were 5 to 10 pages into the result set. -- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.emdgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.