Re: Why are cursor mark queries recommended over regular start, rows combination?

Jason Gerlowski Tue, 20 Mar 2018 09:47:21 -0700

> I can take a stab at this if someone can point me how to update the 
> documentation.



Hey SG,

Please do, that'd be awesome.

Thanks to some work done by Cassandra Targett a release or two ago,
the Solr Ref Guide documentation now lives in the same codebase as the
Solr/Lucene code itself, and the process for updating it is the same
as suggesting a change to the code:


1. Open a JIRA issue detailing the improvement you'd like to make
2. Find the relevant ref guide pages to update, making the changes
you're proposing.
3. Upload a patch to your JIRA and ask for someone to take a look.
(You can tag me on this issue if you'd like).


Some more specific links you might find helpful:

- JIRA: https://issues.apache.org/jira/projects/SOLR/issues
- Pointers on JIRA conventions, creating patches:
https://wiki.apache.org/solr/HowToContribute
- Root directory for the Solr Ref-Guide code:
https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide
- https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html

Best,

Jason

On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> I'm pretty sure you can use Streaming Expressions to get all the rows
> back from a sharded collection without chewing up lots of memory.
>
> Try:
> search(collection,
>              q="id:*",
>              fl="id",
>              sort="id asc",
>             qt="/export")
>
> on a sharded SolrCloud installation, I believe you'll get all the rows back.
>
> NOTE:
> 1> Some while ago you couldn't _stop_ the stream part way through.
> down in the SolrJ world you could read from a stream for a while and
> call close on it but that would just spin in the background until it
> reached EOF. Search the JIRA list if you need (can't find the JIRA
> right now, 6.6 IIRC is OK and, of course, 7.3).
>
> This shouldn't chew up memory since the streams are sorted, so what
> you get in the response is the ordered set of tuples.
>
> Some of the join streams _do_ have to hold all the results in memory,
> so look at the docs if you wind up using those.
>
>
> Best,
> Erick
>
> On Wed, Mar 14, 2018 at 9:20 AM, S G <sg.online.em...@gmail.com> wrote:
>> Thanks everybody. This is lot of good information.
>> And we should try to update this in the documentation too to help users
>> make the right choice.
>> I can take a stab at this if someone can point me how to update the
>> documentation.
>>
>> Thanks
>> SG
>>
>>
>> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <hossman_luc...@fucit.org>
>> wrote:
>>
>>>
>>> : > 3) Lastly, it is not clear the role of export handler. It seems that
>>> the
>>> : > export handler would also have to do exactly the same kind of thing as
>>> : > start=0 and rows=1000,000. And that again means bad performance.
>>>
>>> : <3> First, streaming requests can only return docValues="true"
>>> : fields.Second, most streaming operations require sorting on something
>>> : besides score. Within those constraints, streaming will be _much_
>>> : faster and more efficient than cursorMark. Without tuning I saw 200K
>>> : rows/second returned for streaming, the bottleneck will be the speed
>>> : that the client can read from the network. First of all you only
>>> : execute one query rather than one query per N rows. Second, in the
>>> : cursorMark case, to return a document you and assuming that any field
>>> : you return is docValues=false
>>>
>>> Just to clarify, there is big difference between the /export handler
>>> and "streaming expressions"
>>>
>>> Unless something has changed drasticly in the past few releases, the
>>> /export handler does *NOT* support exporting a full *collection* in solr
>>> cloud -- it only operates on an individual core (aka: shard/replica).
>>>
>>> Streaming expressions is a feature that does work in Cloud mode, and can
>>> make calls to the /export handler on a replica of each shard in order to
>>> process the data of an entire collection -- but when doing so it has to
>>> aggregate the *ALL* the results from every shard in memory on the
>>> coordinating node -- meaning that (in addition to the docvalues caveat)
>>> streaming expressions requires you to "spend" a lot of ram usage on one
>>> node as a trade off for spending more time & multiple requests to get teh
>>> same data from cursorMark...
>>>
>>> https://lucene.apache.org/solr/guide/exporting-result-sets.html
>>> https://lucene.apache.org/solr/guide/streaming-expressions.html
>>>
>>> An additional perk of cursorMakr that may be relevant to the OP is that
>>> you can "stop" tailing a cursor at anytime (ie: if you're post processing
>>> the results client side and decide you have "enough" results) but a simila
>>> feature isn't available (AFAICT) from streaming expressions...
>>>
>>> https://lucene.apache.org/solr/guide/pagination-of-
>>> results.html#tailing-a-cursor
>>>
>>>
>>> -Hoss
>>> http://www.lucidworks.com/
>>>

Re: Why are cursor mark queries recommended over regular start, rows combination?

Reply via email to