On 2/11/2013 12:09 PM, devb wrote:
We are running a six node SOLR cloud which 3 shards and 3 replications. The
version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr
client to interact with Solr. Documents that we add to solr have a unique id
and it can never have duplicates.
Our use case is to query the index for a give searchterm and pull all
documents that matches the query. Usually our query hits over 40K documents.
While we iterate through all 40K+ documents, after few iteration, we see the
same documents ids repeated over and over, and at the end we see some 20-33%
of the records are duplicates.
In the below code snippet after some iterations, we see a difference in the
length of idslist and idsset. Any insight into how to troubleshoot this
issue is greatly appreciated.
For discussion purposes Let's first assume that there are no bugs in
Solr. I don't think we can make that assumption, of course.
General note 1: Your Solr URL in your code has a # in it. The URLs with
# in them are Admin UI URLs. If that's working, I'm amazed... I would
take that part of the URL out so that you are pointing at:
http://host:port/solr/collection1
General note 2: Paging through that many results with a distributed
query (known as deep paging) is SLOW.
http://solr.pl/en/2011/07/18/deep-paging-problem/
The first thing I'd do is ask Solr to sort your results. I can see from
some google searches that pysolr has sort capability. Once you pick the
sort field, I'd probably do the sort ascending, not descending. The
default "sort" is relevance.
The next thing to check is whether or not you are updating your index
during the time that you are attempting to pull 40,000 documents. If
you are, that could completely explain what you are seeing. If you are
only adding documents when you update, then you may be able to set a
sort parameter that will cause new documents to be at the end of the
results, so pagination won't get messed up. If you are deleting
documents, then you won't be able to make this work, you'll have to stop
your index updates while you pull that many results.
After all that, if the problem persists and you are absolutely sure that
you don't have duplicate document X on two different shards, then you
might be running into a bug.
Thanks,
Shawn