When we have duplicated documents (same uniqueID) among the shards, the query
results could be non-deterministic, this is an known issue.

The consequence when we display the search results on our UI page with
paginating is: if user click the 'last page', it could display an empty page
since the total doc count returned by the query with dups is not accurate
(includes dups apparently).

Is there a known work around for this problem?

We tried the following 2 approaches but each of them problem:
1) use a query like:
curl -d "q=*:*&fl=message_id&rows=199999999&start=19999999"
http://[hostname]:8080/mywebapp/shards/[coreid]/select?
Since I am using a very large number for the 'rows', it will return the
accurate doc count, but it takes about 20 second to run this query on an
average customer with a little over 1 million rows returned, so the
performance is not acceptable.

2) use facet query:
curl -d
"q=*:*&fl=message_id&facet=true&facet.mincount=2&rows=0&facet.field=message_id&indent=on"
http://[hostname]:8080/[mywebapp]/shards/[coreid]/select?
the test shows this might not return accurate doc counts from time to time.

any suggestions what is the best work around to get an accurate doc count
with sharded query with dups, and efficient when run with large data set?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/shard-query-with-duplicated-documents-cause-inaccuate-paginating-tp4133666.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to