Re: Solr Cloud: Duplicate records while retrieving documents

Shawn Heisey Mon, 11 Feb 2013 15:21:10 -0800

On 2/11/2013 12:09 PM, devb wrote:

We are running a six node SOLR cloud which 3 shards and 3 replications. The
version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr
client to interact with Solr. Documents that we add to solr have a unique id
and it can never have duplicates.
Our use case is to query the index for a give searchterm and pull all
documents that matches the query. Usually our query hits over 40K documents.
While we iterate through all 40K+ documents, after few iteration, we see the
same documents ids repeated over and over, and at the end we see some 20-33%
of the records are duplicates.
In the below code snippet after some iterations, we see a difference in the
length of idslist and idsset. Any insight into how to troubleshoot this
issue is greatly appreciated.

For discussion purposes Let's first assume that there are no bugs inSolr. I don't think we can make that assumption, of course.

General note 1: Your Solr URL in your code has a # in it. The URLs with# in them are Admin UI URLs. If that's working, I'm amazed... I wouldtake that part of the URL out so that you are pointing at:


http://host:port/solr/collection1

General note 2: Paging through that many results with a distributedquery (known as deep paging) is SLOW.


http://solr.pl/en/2011/07/18/deep-paging-problem/

The first thing I'd do is ask Solr to sort your results. I can see fromsome google searches that pysolr has sort capability. Once you pick thesort field, I'd probably do the sort ascending, not descending. Thedefault "sort" is relevance.

The next thing to check is whether or not you are updating your indexduring the time that you are attempting to pull 40,000 documents. Ifyou are, that could completely explain what you are seeing. If you areonly adding documents when you update, then you may be able to set asort parameter that will cause new documents to be at the end of theresults, so pagination won't get messed up. If you are deletingdocuments, then you won't be able to make this work, you'll have to stopyour index updates while you pull that many results.

After all that, if the problem persists and you are absolutely sure thatyou don't have duplicate document X on two different shards, then youmight be running into a bug.


Thanks,
Shawn

Re: Solr Cloud: Duplicate records while retrieving documents

Reply via email to