On 5/19/2014 2:05 PM, Bryan Bende wrote: > Using Solr 4.6.1 and in my schema I have a date field storing the time a > document was added to Solr. > > I have a utility program which: > - queries for all of the documents in the previous day sorted by create date > - pages through the results keeping track of the unique document ids > - compare the total number of unique doc ids to the numFound to see if it > they match > > I've noticed that if I use a page size larger than the number of documents > for the given day (aka get everything in one query), then everything works > as expected (results sorted correctly, unique doc ids size == numFound). > > However, when I use a smaller page say, say 10 rows per page, I randomly > see cases where the last document of a page will be duplicated as the first > document of the next page, even though the "start" and "rows" parameters > increased correctly. So I might see something like numFound=100 but unique > doc ids is 97, and then I see three occurrences where the last doc id on a > page was also the first on the next page.
This *sounds* like a situation where you have a sharded index that has the same uniqueKey value in more than one shard. This situation will cause Solr to behave in a way that looks completely unpredictable. There is no way for Solr to deal with this problem in a way that would not consume large amounts of real time, CPU time, and RAM ... so Solr does not do anything for dealing with this problem other than removing duplicates from the actual results returned -- which is actually how the discrepancies occur. If you are absolutely sure that you are not running into the duplicate document problem I described, then I am not sure what's going on. It might be related to the sort, and if that's true, adding a second sort parameter using your uniqueKey field might be a solution. Thanks, Shawn