You can also implement your own cursor easily enough if you have a unique sortkey (not relevance score). Say you can sort by id, then you select batch 1 (50k docs, say) and record the last (maximum) id in the batch. For the next batch, limit it to id > last_id and get the first 50k docs (don't use start= for paging). This scales much better when scanning a large result set; you'll get constant time across the whole set instead of having it increase as you page deeper.

-Mike

On 1/18/2015 7:45 AM, Naresh Yadav wrote:
Hi Toke,

Thanks for sharing solr internal's for my problem. I will definitely try
Cursor also but only problem is my current
solr version is 4.6.1 in which i guess cursor support is not there. Any
other option i have for this problem ??

Also as per your suggestion i will try to avoid regional units in post.

Thanks
Naresh

On Sun, Jan 18, 2015 at 4:19 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

Naresh Yadav [nyadav....@gmail.com] wrote:
In both setups, we are reading in batches of 50k and each batch taking
Setup1  : approx 7 seconds and for completing all batches of total 10
lakh
results takes 1 to 2 minutes.
Setup2 : approx 2-3 minutes and for completing all batches of total 10
lakh
results  takes 114 minutes.
Deep paging across shards without cursors means that for each request, the
full result set up to that point must be requested from each shard. The
deeper your page, the longer it takes for each request. If you only
extracted 500K results instead of the 1M in setup 2, it would likely take a
lot less than 114/2 minutes.

Since you are exporting the full result set, you should be using a cursor:
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
This should make your extraction linear to the number of documents and
hopefully a lot faster than your current setup.

Also, please refrain from using regional units such as "lakh" in an
international forum. It requires some readers (me for example) to perform a
search in order to be sure on what you are talking about.

- Toke Eskildsen


Reply via email to