Re: solr as nosql - pulling all docs vs deep paging limitations

Chris Hostetter Tue, 17 Dec 2013 10:17:26 -0800

: Then I remembered we currently don't allow deep paging in our current 
: search indexes as performance declines the deeper you go.  Is this still 
: the case?


Coincidently, i'm working on a new cursor based API to make this much more 
feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the 
results last week...

http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

...current iterations on the patch are to eliminate the 
strawman code to improve performance even more and beef up the test 
cases.

: If so, is there another approach to make all the data in a collection 
: easily available for retrieval?  The only thing I can think of is to 
        ...
: Then I was thinking we could have a field with an incrementing numeric 
: value which could be used to perform range queries as a substitute for 
: paging through everything.  Ie queries like 'IncrementalField:[1 TO 
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to 
: maintain as we update the index unless we reindex the entire collection 
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey field 
that supports range queries, bulk exporting of all documents is fairly 
trivial by sorting on your uniqueKey field and using an fq that also 
filters on your uniqueKey field modify the fq each time to change the 
lower bound to match the highest ID you got on the previous "page".  

This approach works really well in simple cases where you wnat to "fetch 
all" documents matching a query and then process/sort them by some other 
criteria on the client -- but it's not viable if it's important to you 
that the documents come back from solr in score order before your client 
gets them because you want to "stop fetching" once some criteria is met in 
your client.  Example: you have billions of documents matching a query, 
you want to fetch all sorted by score desc and crunch them on your client 
to compute some stats, and once your client side stat crunching tells you 
you have enough results (which might be after the 1000th result, or might 
be after the millionth result) then you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch should 
easy to use in the next day or so (having other people try out and 
test in their applications would be *very* helpful) and hopefully show up 
in Solr 4.7

-Hoss
http://www.lucidworks.com/

Re: solr as nosql - pulling all docs vs deep paging limitations

Reply via email to