Hi Mark, Thanks for confirming Dwane's advice from your own experience. I
will shift to a streaming expressions implementation.

Best
Goutham

On Fri, Sep 25, 2020 at 7:03 PM Mark H. Wood <mw...@iupui.edu> wrote:

> On Fri, Sep 25, 2020 at 11:49:22AM +0530, Goutham Tholpadi wrote:
> > I have around 30M documents in Solr, and I am doing repeated *:* queries
> > with rows=10000, and changing start to 0, 10000, 20000, and so on, in a
> > loop in my script (using pysolr).
> >
> > At the start of the iteration, the calls to Solr were taking less than 1
> > sec each. After running for a few hours (with start at around 27M) I
> found
> > that each call was taking around 30-60 secs.
> >
> > Any pointers on why the same fetch of 10000 records takes much longer
> now?
> > Does Solr need to load all the 27M before getting the last 10000 records?
>
> I and many others have run into the same issue.  Yes, each windowed
> query starts fresh, having to find at least enough records to satisfy
> the query, walking the list to discard the first 'start' worth of
> them, and then returning the next 'rows' worth.  So as 'start' increases,
> the work required of Solr increases and the response time lengthens.
>
> > Is there a better way to do this operation using Solr?
>
> Another answer in this thread gives links to resources for addressing
> the problem, and I can't improve on those.
>
> I can say that when I switched from start= windowing to cursormark, I
> got a very nice improvement in overall speed and did not see the
> progressive slowing anymore.  A query loop that ran for *days* now
> completes in under five minutes.  In some way that I haven't quite
> figured out, a cursormark tells Solr where in the overall document
> sequence to start working.
>
> So yes, there *is* a better way.
>
> --
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu
>

Reply via email to