Just FYI I had a project recently where I tried to use cursorMark in Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even return consistent numberFound values. I posted about it in this forum. Using the start and rows arguments in SolrQuery did work reliably so I abandoned cursorMark as just too buggy
I had originally wanted to try using streaming expressions, but they don't return results ordered by relevancy, a major limitation for a search engine, in my opinion. On Tue, Mar 20, 2018 at 11:47 AM, Jason Gerlowski <gerlowsk...@gmail.com> wrote: > > I can take a stab at this if someone can point me how to update the > documentation. > > > Hey SG, > > Please do, that'd be awesome. > > Thanks to some work done by Cassandra Targett a release or two ago, > the Solr Ref Guide documentation now lives in the same codebase as the > Solr/Lucene code itself, and the process for updating it is the same > as suggesting a change to the code: > > > 1. Open a JIRA issue detailing the improvement you'd like to make > 2. Find the relevant ref guide pages to update, making the changes > you're proposing. > 3. Upload a patch to your JIRA and ask for someone to take a look. > (You can tag me on this issue if you'd like). > > > Some more specific links you might find helpful: > > - JIRA: https://issues.apache.org/jira/projects/SOLR/issues > - Pointers on JIRA conventions, creating patches: > https://wiki.apache.org/solr/HowToContribute > - Root directory for the Solr Ref-Guide code: > https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide > - https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html > > Best, > > Jason > > On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > I'm pretty sure you can use Streaming Expressions to get all the rows > > back from a sharded collection without chewing up lots of memory. > > > > Try: > > search(collection, > > q="id:*", > > fl="id", > > sort="id asc", > > qt="/export") > > > > on a sharded SolrCloud installation, I believe you'll get all the rows > back. > > > > NOTE: > > 1> Some while ago you couldn't _stop_ the stream part way through. > > down in the SolrJ world you could read from a stream for a while and > > call close on it but that would just spin in the background until it > > reached EOF. Search the JIRA list if you need (can't find the JIRA > > right now, 6.6 IIRC is OK and, of course, 7.3). > > > > This shouldn't chew up memory since the streams are sorted, so what > > you get in the response is the ordered set of tuples. > > > > Some of the join streams _do_ have to hold all the results in memory, > > so look at the docs if you wind up using those. > > > > > > Best, > > Erick > > > > On Wed, Mar 14, 2018 at 9:20 AM, S G <sg.online.em...@gmail.com> wrote: > >> Thanks everybody. This is lot of good information. > >> And we should try to update this in the documentation too to help users > >> make the right choice. > >> I can take a stab at this if someone can point me how to update the > >> documentation. > >> > >> Thanks > >> SG > >> > >> > >> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter < > hossman_luc...@fucit.org> > >> wrote: > >> > >>> > >>> : > 3) Lastly, it is not clear the role of export handler. It seems > that > >>> the > >>> : > export handler would also have to do exactly the same kind of > thing as > >>> : > start=0 and rows=1000,000. And that again means bad performance. > >>> > >>> : <3> First, streaming requests can only return docValues="true" > >>> : fields.Second, most streaming operations require sorting on something > >>> : besides score. Within those constraints, streaming will be _much_ > >>> : faster and more efficient than cursorMark. Without tuning I saw 200K > >>> : rows/second returned for streaming, the bottleneck will be the speed > >>> : that the client can read from the network. First of all you only > >>> : execute one query rather than one query per N rows. Second, in the > >>> : cursorMark case, to return a document you and assuming that any field > >>> : you return is docValues=false > >>> > >>> Just to clarify, there is big difference between the /export handler > >>> and "streaming expressions" > >>> > >>> Unless something has changed drasticly in the past few releases, the > >>> /export handler does *NOT* support exporting a full *collection* in > solr > >>> cloud -- it only operates on an individual core (aka: shard/replica). > >>> > >>> Streaming expressions is a feature that does work in Cloud mode, and > can > >>> make calls to the /export handler on a replica of each shard in order > to > >>> process the data of an entire collection -- but when doing so it has to > >>> aggregate the *ALL* the results from every shard in memory on the > >>> coordinating node -- meaning that (in addition to the docvalues caveat) > >>> streaming expressions requires you to "spend" a lot of ram usage on one > >>> node as a trade off for spending more time & multiple requests to get > teh > >>> same data from cursorMark... > >>> > >>> https://lucene.apache.org/solr/guide/exporting-result-sets.html > >>> https://lucene.apache.org/solr/guide/streaming-expressions.html > >>> > >>> An additional perk of cursorMakr that may be relevant to the OP is that > >>> you can "stop" tailing a cursor at anytime (ie: if you're post > processing > >>> the results client side and decide you have "enough" results) but a > simila > >>> feature isn't available (AFAICT) from streaming expressions... > >>> > >>> https://lucene.apache.org/solr/guide/pagination-of- > >>> results.html#tailing-a-cursor > >>> > >>> > >>> -Hoss > >>> http://www.lucidworks.com/ > >>> > -- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.emdgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.