On Fri, Sep 25, 2020 at 11:49:22AM +0530, Goutham Tholpadi wrote: > I have around 30M documents in Solr, and I am doing repeated *:* queries > with rows=10000, and changing start to 0, 10000, 20000, and so on, in a > loop in my script (using pysolr). > > At the start of the iteration, the calls to Solr were taking less than 1 > sec each. After running for a few hours (with start at around 27M) I found > that each call was taking around 30-60 secs. > > Any pointers on why the same fetch of 10000 records takes much longer now? > Does Solr need to load all the 27M before getting the last 10000 records?
I and many others have run into the same issue. Yes, each windowed query starts fresh, having to find at least enough records to satisfy the query, walking the list to discard the first 'start' worth of them, and then returning the next 'rows' worth. So as 'start' increases, the work required of Solr increases and the response time lengthens. > Is there a better way to do this operation using Solr? Another answer in this thread gives links to resources for addressing the problem, and I can't improve on those. I can say that when I switched from start= windowing to cursormark, I got a very nice improvement in overall speed and did not see the progressive slowing anymore. A query loop that ran for *days* now completes in under five minutes. In some way that I haven't quite figured out, a cursormark tells Solr where in the overall document sequence to start working. So yes, there *is* a better way. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu
signature.asc
Description: PGP signature