Hi Walter, Thank you also for your reply. Good to know of your experience. Roughly how many documents were you fetching? Unfortunately, it's possible that some of our users could attempt to "download" many records, meaning we'd need to make a request to Solr where rows >= 150M. A key challenge for us is that in the life sciences, when more sequencing data comes in, it's possible for our data-sets to grow extremely quickly. Currently it doubles every 18 months or so (and today we have about 200M records, so not super big right now).
Best, Edd -------------------- Edward Turner On Tue, 1 Oct 2019 at 17:33, Walter Underwood <wun...@wunderwood.org> wrote: > I had to do this recently on a Solr Cloud cluster. I wanted to export all > the IDs, but they weren’t stored as docvalues. > > The fastest approach was to fetch all the IDs in one request. First, I > make a request for zero rows to get the numFound. Then I fetch > numFound+1000 (in case docs were added while I wasn’t looking) in one > request. > > I also have a hairy shell script to do /export on each leader after > parsing cluster status. That might be a little large to post to this list, > but I can do it if there is general interest. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Oct 1, 2019, at 9:14 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > > First, thanks for taking the time to ask a question with enough > supporting details that I can hope to be able to answer in one exchange ;). > It’s a pleasure to see. > > > > Second, NP with asking on Stack Overflow, they have some excellent > answers there. But you’re right, this list gets more Solr-centered eyeballs. > > > > On to your question. I think the best answer was that “/export wasn’t > designed to deal with scores”, which you’ll find disappointing. > > > > You could use the Streaming “search” expression (using qt=/select or > just leave qt out) but that’ll sort all of the docs you’re exporting into a > huge list, which may perform worse than CursorMark even if it doesn’t blow > up memory. > > > > The root of this problem is that export can sort in batches since the > values it’s sorting on are contained in each document, so it can iterate in > batches, send them out, then iterate again on the remaining documents. > > > > Score, since it’s dynamic, can’t do that. Solr has to score _all_ the > docs to know where a doc lands in the final set relative to any other doc, > so if it were going to work it’d have to have enough memory to hold the > scores of all the docs in an ordered list, which is very expensive. > Conceptually this is an ordered list up to maxDoc long. Not only does there > have to be enough memory to hold the entire list, every doc has to be > inserted individually which can kill performance. This is the “deep paging” > problem. > > > > In the usual case of returning, say, 20 docs, the sorted list only has > to be 20 long, higher scoring docs evict lower scoring docs. > > > > So I think CursorMark is your best bet. > > > > Best, > > Erick > > > >> On Oct 1, 2019, at 3:59 AM, Edward Turner <eddtur...@gmail.com> wrote: > >> > >> Hi all, > >> > >> As far as I understand, SolrCloud currently does not allow the use of > >> sorting by the pseudofield, score in the /export request handler (i.e., > get > >> the results in relevancy order). If we do attempt this, we get an > >> exception, "org.apache.solr.search.SyntaxError: Scoring is not currently > >> supported with xsort". We could use Solr's cursorMark, but this takes a > >> very long time ... > >> > >> Exporting results does work, however, when exporting result sets by a > >> specific document field that has docValues set to true. > >> > >> Question: > >> Does anyone know if/when it will be possible to sort by score in the > >> /export handler? > >> > >> Research on the problem: > >> We've seen https://issues.apache.org/jira/browse/SOLR-5244 and > >> https://issues.apache.org/jira/browse/SOLR-8664, which are related to > this > >> issue, but don't fix it. Maybe I've missed a more relevant issue? > >> > >> Our use-case We are using Solrcloud in our team and it's added a huge > >> amount of value to our users. > >> > >> We show a table of search results ordered by score (relevancy) that was > >> obtained from sending a query to the standard /select handler. We're > >> working in the life-sciences domain and it is common for our result > sets to > >> contain many millions of results (unfortunately). After users browse > their > >> results, they then may want to download the results that they see, to do > >> some post-processing. However, to do this, such that the results appear > in > >> the order that the user originally saw them, we'd need to be able to > export > >> results based on score/relevancy. > >> > >> Any suggestions or advice on this would be greatly appreciated! > >> > >> Many thanks! > >> > >> Edd > >> > >> PS. apologies for posting also on Stackoverflow ( > >> > https://stackoverflow.com/questions/58167152/solrcloud-export-all-results-sorted-by-score > ) > >> -- > >> I only discovered the Solr mailing-list afterwards and thought it > probably > >> better to reach out directly to Solr's people (I can share any answer > from > >> this forum on there retrospectively). > > > >