When we have duplicated documents (same uniqueID) among the shards, the query results could be non-deterministic, this is an known issue.
The consequence when we display the search results on our UI page with paginating is: if user click the 'last page', it could display an empty page since the total doc count returned by the query with dups is not accurate (includes dups apparently). Is there a known work around for this problem? We tried the following 2 approaches but each of them problem: 1) use a query like: curl -d "q=*:*&fl=message_id&rows=199999999&start=19999999" http://[hostname]:8080/mywebapp/shards/[coreid]/select? Since I am using a very large number for the 'rows', it will return the accurate doc count, but it takes about 20 second to run this query on an average customer with a little over 1 million rows returned, so the performance is not acceptable. 2) use facet query: curl -d "q=*:*&fl=message_id&facet=true&facet.mincount=2&rows=0&facet.field=message_id&indent=on" http://[hostname]:8080/[mywebapp]/shards/[coreid]/select? the test shows this might not return accurate doc counts from time to time. any suggestions what is the best work around to get an accurate doc count with sharded query with dups, and efficient when run with large data set? thanks Jie -- View this message in context: http://lucene.472066.n3.nabble.com/shard-query-with-duplicated-documents-cause-inaccuate-paginating-tp4133666.html Sent from the Solr - User mailing list archive at Nabble.com.