Michael, i tried your idea of implementing own cursor in solr 4.6.1 itself but some how that testcase was taking huge time. Then i tried Cursor approach by upgrading solr to 4.10.3. With that got better results. For Setup 2 now time reduced from 114 minutes to 18 minutes but still little far from Setup1 i.e 2 minutes. Actually first 50 thousand request it self is taking about a minute. May be i would need to see other things as pagination seems working better now.
thanks for giving valuable suggestions. On Mon, Jan 19, 2015 at 11:20 AM, Naresh Yadav <nyadav....@gmail.com> wrote: > Toke, won't be able to use TermsComponent as i had complex filter criteria > on other fields. > > Michael, i understood your idea of paging without using start=, > will prototype it as it is possible in my usecase also and post here > results i got with this approach. > > > On Sun, Jan 18, 2015 at 10:05 PM, Michael Sokolov < > msoko...@safaribooksonline.com> wrote: > >> You can also implement your own cursor easily enough if you have a unique >> sortkey (not relevance score). Say you can sort by id, then you select >> batch 1 (50k docs, say) and record the last (maximum) id in the batch. For >> the next batch, limit it to id > last_id and get the first 50k docs (don't >> use start= for paging). This scales much better when scanning a large >> result set; you'll get constant time across the whole set instead of having >> it increase as you page deeper. >> >> -Mike >> >> >> On 1/18/2015 7:45 AM, Naresh Yadav wrote: >> >>> Hi Toke, >>> >>> Thanks for sharing solr internal's for my problem. I will definitely try >>> Cursor also but only problem is my current >>> solr version is 4.6.1 in which i guess cursor support is not there. Any >>> other option i have for this problem ?? >>> >>> Also as per your suggestion i will try to avoid regional units in post. >>> >>> Thanks >>> Naresh >>> >>> On Sun, Jan 18, 2015 at 4:19 PM, Toke Eskildsen <t...@statsbiblioteket.dk> >>> wrote: >>> >>> Naresh Yadav [nyadav....@gmail.com] wrote: >>>> >>>>> In both setups, we are reading in batches of 50k and each batch taking >>>>> Setup1 : approx 7 seconds and for completing all batches of total 10 >>>>> >>>> lakh >>>> >>>>> results takes 1 to 2 minutes. >>>>> Setup2 : approx 2-3 minutes and for completing all batches of total 10 >>>>> >>>> lakh >>>> >>>>> results takes 114 minutes. >>>>> >>>> Deep paging across shards without cursors means that for each request, >>>> the >>>> full result set up to that point must be requested from each shard. The >>>> deeper your page, the longer it takes for each request. If you only >>>> extracted 500K results instead of the 1M in setup 2, it would likely >>>> take a >>>> lot less than 114/2 minutes. >>>> >>>> Since you are exporting the full result set, you should be using a >>>> cursor: >>>> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results >>>> This should make your extraction linear to the number of documents and >>>> hopefully a lot faster than your current setup. >>>> >>>> Also, please refrain from using regional units such as "lakh" in an >>>> international forum. It requires some readers (me for example) to >>>> perform a >>>> search in order to be sure on what you are talking about. >>>> >>>> - Toke Eskildsen >>>> >>>> >> > > > >