<1> consider start=100&rows=10. In the absence of cursorMark, Solr has to sort the top 110 documents in order to throw away the first 100 since the last document scored could be in the top 110 and there's no way to know that ahead of time. For 110 that's not very expensive, but when the list is in the 100s of K, it gets significantly expensive both in terms of CPU and memory. Now multiply that by the number of shards (i.e. one replica from each shard would have to return the top 110 document IDs and scores in order for the aggregator to sort out the true top 10) and it gets really expensive.
CursorMark essentially passes the score back (well, all the sort criterion's last values). I'll skip a lot of details here that make this more complex, but assume cursorMark is the score of the 100th document (yes, there's code in there to handle multiple identical scores and a lot of other stuff, that's why it's required to have uniqueKey in the sort). Now each node can say "if a doc has a score > cursorMark, I can throw it out immediately 'cause it was returned already", and now each shard just keeps a list 10 docs long. <2> CursorMark is SolrCloud compatible. <3> First, streaming requests can only return docValues="true" fields.Second, most streaming operations require sorting on something besides score. Within those constraints, streaming will be _much_ faster and more efficient than cursorMark. Without tuning I saw 200K rows/second returned for streaming, the bottleneck will be the speed that the client can read from the network. First of all you only execute one query rather than one query per N rows. Second, in the cursorMark case, to return a document you and assuming that any field you return is docValues=false > read it from disk > decompress it > fetch the stored fields with streaming, since we're using docValues fields, the disk seek/read/decompress steps are skipped. Best, Erick On Mon, Mar 12, 2018 at 5:18 PM, S G <sg.online.em...@gmail.com> wrote: > Hi, > > We have use-cases where some queries will return about 100k to 500k records. > As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html, > it seems that using start=x, rows=y is a bad combination performance wise. > > 1) However, it is not clear to me why the alternative: "cursor-query" is > cheaper or recommended. It would have to run the same kind of workload as > the normal start=x, rows=y combination, no? > > 2) Also, it is not clear if the cursory-query runs on a single shard or > uses the same scatter gather as regular queries to read from all the shards? > > 3) Lastly, it is not clear the role of export handler. It seems that the > export handler would also have to do exactly the same kind of thing as > start=0 and rows=1000,000. And that again means bad performance. > > What is the difference between all of the 3 > > > Thanks > SG