<1> consider start=100&rows=10. In the absence of cursorMark, Solr has
to sort the top 110 documents in order to throw away the first 100
since the last document scored could be in the top 110 and there's no
way to know that ahead of time. For 110 that's not very expensive, but
when the list is in the 100s of K, it gets significantly expensive
both in terms of CPU and memory. Now multiply that by the number of
shards (i.e. one replica from each shard would have to return the top
110 document IDs and scores in order for the aggregator to sort out
the true top 10) and it gets really expensive.

CursorMark essentially passes the score back (well, all the sort
criterion's last values). I'll skip a lot of details here that make
this more complex, but assume cursorMark is the score of the 100th
document (yes, there's code in there to handle multiple identical
scores and a lot of other stuff, that's why it's required to have
uniqueKey in the sort). Now each node can say "if a doc has a score >
cursorMark, I can throw it out immediately 'cause it was returned
already", and now each shard just keeps a list 10 docs long.

<2> CursorMark is SolrCloud compatible.

<3> First, streaming requests can only return docValues="true"
fields.Second, most streaming operations require sorting on something
besides score. Within those constraints, streaming will be _much_
faster and more efficient than cursorMark. Without tuning I saw 200K
rows/second returned for streaming, the bottleneck will be the speed
that the client can read from the network. First of all you only
execute one query rather than one query per N rows. Second, in the
cursorMark case, to return a document you and assuming that any field
you return is docValues=false
> read it from disk
> decompress it
> fetch the stored fields

with streaming, since we're using docValues fields, the disk
seek/read/decompress steps are skipped.

Best,
Erick

On Mon, Mar 12, 2018 at 5:18 PM, S G <sg.online.em...@gmail.com> wrote:
> Hi,
>
> We have use-cases where some queries will return about 100k to 500k records.
> As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html,
> it seems that using start=x, rows=y is a bad combination performance wise.
>
> 1) However, it is not clear to me why the alternative: "cursor-query" is
> cheaper or recommended. It would have to run the same kind of workload as
> the normal start=x, rows=y combination, no?
>
> 2) Also, it is not clear if the cursory-query runs on a single shard or
> uses the same scatter gather as regular queries to read from all the shards?
>
> 3) Lastly, it is not clear the role of export handler. It seems that the
> export handler would also have to do exactly the same kind of thing as
> start=0 and rows=1000,000. And that again means bad performance.
>
> What is the difference between all of the 3
>
>
> Thanks
> SG

Reply via email to