> I can take a stab at this if someone can point me how to update the > documentation.
Hey SG, Please do, that'd be awesome. Thanks to some work done by Cassandra Targett a release or two ago, the Solr Ref Guide documentation now lives in the same codebase as the Solr/Lucene code itself, and the process for updating it is the same as suggesting a change to the code: 1. Open a JIRA issue detailing the improvement you'd like to make 2. Find the relevant ref guide pages to update, making the changes you're proposing. 3. Upload a patch to your JIRA and ask for someone to take a look. (You can tag me on this issue if you'd like). Some more specific links you might find helpful: - JIRA: https://issues.apache.org/jira/projects/SOLR/issues - Pointers on JIRA conventions, creating patches: https://wiki.apache.org/solr/HowToContribute - Root directory for the Solr Ref-Guide code: https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide - https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html Best, Jason On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson <erickerick...@gmail.com> wrote: > I'm pretty sure you can use Streaming Expressions to get all the rows > back from a sharded collection without chewing up lots of memory. > > Try: > search(collection, > q="id:*", > fl="id", > sort="id asc", > qt="/export") > > on a sharded SolrCloud installation, I believe you'll get all the rows back. > > NOTE: > 1> Some while ago you couldn't _stop_ the stream part way through. > down in the SolrJ world you could read from a stream for a while and > call close on it but that would just spin in the background until it > reached EOF. Search the JIRA list if you need (can't find the JIRA > right now, 6.6 IIRC is OK and, of course, 7.3). > > This shouldn't chew up memory since the streams are sorted, so what > you get in the response is the ordered set of tuples. > > Some of the join streams _do_ have to hold all the results in memory, > so look at the docs if you wind up using those. > > > Best, > Erick > > On Wed, Mar 14, 2018 at 9:20 AM, S G <sg.online.em...@gmail.com> wrote: >> Thanks everybody. This is lot of good information. >> And we should try to update this in the documentation too to help users >> make the right choice. >> I can take a stab at this if someone can point me how to update the >> documentation. >> >> Thanks >> SG >> >> >> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <hossman_luc...@fucit.org> >> wrote: >> >>> >>> : > 3) Lastly, it is not clear the role of export handler. It seems that >>> the >>> : > export handler would also have to do exactly the same kind of thing as >>> : > start=0 and rows=1000,000. And that again means bad performance. >>> >>> : <3> First, streaming requests can only return docValues="true" >>> : fields.Second, most streaming operations require sorting on something >>> : besides score. Within those constraints, streaming will be _much_ >>> : faster and more efficient than cursorMark. Without tuning I saw 200K >>> : rows/second returned for streaming, the bottleneck will be the speed >>> : that the client can read from the network. First of all you only >>> : execute one query rather than one query per N rows. Second, in the >>> : cursorMark case, to return a document you and assuming that any field >>> : you return is docValues=false >>> >>> Just to clarify, there is big difference between the /export handler >>> and "streaming expressions" >>> >>> Unless something has changed drasticly in the past few releases, the >>> /export handler does *NOT* support exporting a full *collection* in solr >>> cloud -- it only operates on an individual core (aka: shard/replica). >>> >>> Streaming expressions is a feature that does work in Cloud mode, and can >>> make calls to the /export handler on a replica of each shard in order to >>> process the data of an entire collection -- but when doing so it has to >>> aggregate the *ALL* the results from every shard in memory on the >>> coordinating node -- meaning that (in addition to the docvalues caveat) >>> streaming expressions requires you to "spend" a lot of ram usage on one >>> node as a trade off for spending more time & multiple requests to get teh >>> same data from cursorMark... >>> >>> https://lucene.apache.org/solr/guide/exporting-result-sets.html >>> https://lucene.apache.org/solr/guide/streaming-expressions.html >>> >>> An additional perk of cursorMakr that may be relevant to the OP is that >>> you can "stop" tailing a cursor at anytime (ie: if you're post processing >>> the results client side and decide you have "enough" results) but a simila >>> feature isn't available (AFAICT) from streaming expressions... >>> >>> https://lucene.apache.org/solr/guide/pagination-of- >>> results.html#tailing-a-cursor >>> >>> >>> -Hoss >>> http://www.lucidworks.com/ >>>