Hi Mark, Thanks for confirming Dwane's advice from your own experience. I will shift to a streaming expressions implementation.
Best Goutham On Fri, Sep 25, 2020 at 7:03 PM Mark H. Wood <mw...@iupui.edu> wrote: > On Fri, Sep 25, 2020 at 11:49:22AM +0530, Goutham Tholpadi wrote: > > I have around 30M documents in Solr, and I am doing repeated *:* queries > > with rows=10000, and changing start to 0, 10000, 20000, and so on, in a > > loop in my script (using pysolr). > > > > At the start of the iteration, the calls to Solr were taking less than 1 > > sec each. After running for a few hours (with start at around 27M) I > found > > that each call was taking around 30-60 secs. > > > > Any pointers on why the same fetch of 10000 records takes much longer > now? > > Does Solr need to load all the 27M before getting the last 10000 records? > > I and many others have run into the same issue. Yes, each windowed > query starts fresh, having to find at least enough records to satisfy > the query, walking the list to discard the first 'start' worth of > them, and then returning the next 'rows' worth. So as 'start' increases, > the work required of Solr increases and the response time lengthens. > > > Is there a better way to do this operation using Solr? > > Another answer in this thread gives links to resources for addressing > the problem, and I can't improve on those. > > I can say that when I switched from start= windowing to cursormark, I > got a very nice improvement in overall speed and did not see the > progressive slowing anymore. A query loop that ran for *days* now > completes in under five minutes. In some way that I haven't quite > figured out, a cursormark tells Solr where in the overall document > sequence to start working. > > So yes, there *is* a better way. > > -- > Mark H. Wood > Lead Technology Analyst > > University Library > Indiana University - Purdue University Indianapolis > 755 W. Michigan Street > Indianapolis, IN 46202 > 317-274-0749 > www.ulib.iupui.edu >