Re: Solrcloud export all results sorted by score

Jörn Franke Thu, 03 Oct 2019 00:56:11 -0700

Maybe you can sort later using Spark or similar. For that you don’t need a full 
blown cluster - it runs also on localhost.


> Am 03.10.2019 um 09:49 schrieb Edward Turner <eddtur...@gmail.com>:
> 
> Hi Erick,
> 
> Many thanks for your detailed reply. It's really good information for us to
> know, and although not exactly what we wanted to hear (that /export wasn't
> designed to handle ranking), it's much better for us to definitively know
> one way or the other -- and this allows us to move forward. We'll
> experiment by going the cursorMark route. I'm hoping that the bottleneck
> then isn't Solr, but rather the fetching and writing of the full records
> (we use Solr as just a search engine, which gives us IDs of records of
> interest; and we use a separate key-value store to get the actual record
> data). Anyway, we'll see and fingers crossed :).
> 
> Best wishes,
> 
> Edd
> 
> 
> 
>> On Tue, 1 Oct 2019 at 17:15, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> First, thanks for taking the time to ask a question with enough supporting
>> details that I can hope to be able to answer in one exchange ;). It’s a
>> pleasure to see.
>> 
>> Second, NP with asking on Stack Overflow, they have some excellent answers
>> there. But you’re right, this list gets more Solr-centered eyeballs.
>> 
>> On to your question. I think the best answer was that “/export wasn’t
>> designed to deal with scores”, which you’ll find disappointing.
>> 
>> You could use the Streaming “search” expression (using qt=/select or just
>> leave qt out) but that’ll sort all of the docs you’re exporting into a huge
>> list, which may perform worse than CursorMark even if it doesn’t blow up
>> memory.
>> 
>> The root of this problem is that export can sort in batches since the
>> values it’s sorting on are contained in each document, so it can iterate in
>> batches, send them out, then iterate again on the remaining documents.
>> 
>> Score, since it’s dynamic, can’t do that. Solr has to score _all_ the docs
>> to know where a doc lands in the final set relative to any other doc, so if
>> it were going to work it’d have to have enough memory to hold the scores of
>> all the docs in an ordered list, which is very expensive. Conceptually this
>> is an ordered list up to maxDoc long. Not only does there have to be enough
>> memory to hold the entire list, every doc has to be inserted individually
>> which can kill performance. This is the “deep paging” problem.
>> 
>> In the usual case of returning, say, 20 docs, the sorted list only has to
>> be 20 long, higher scoring docs evict lower scoring docs.
>> 
>> So I think CursorMark is your best bet.
>> 
>> Best,
>> Erick
>> 
>>>> On Oct 1, 2019, at 3:59 AM, Edward Turner <eddtur...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> As far as I understand, SolrCloud currently does not allow the use of
>>> sorting by the pseudofield, score in the /export request handler (i.e.,
>> get
>>> the results in relevancy order). If we do attempt this, we get an
>>> exception, "org.apache.solr.search.SyntaxError: Scoring is not currently
>>> supported with xsort". We could use Solr's cursorMark, but this takes a
>>> very long time ...
>>> 
>>> Exporting results does work, however, when exporting result sets by a
>>> specific document field that has docValues set to true.
>>> 
>>> Question:
>>> Does anyone know if/when it will be possible to sort by score in the
>>> /export handler?
>>> 
>>> Research on the problem:
>>> We've seen https://issues.apache.org/jira/browse/SOLR-5244 and
>>> https://issues.apache.org/jira/browse/SOLR-8664, which are related to
>> this
>>> issue, but don't fix it. Maybe I've missed a more relevant issue?
>>> 
>>> Our use-case We are using Solrcloud in our team and it's added a huge
>>> amount of value to our users.
>>> 
>>> We show a table of search results ordered by score (relevancy) that was
>>> obtained from sending a query to the standard /select handler. We're
>>> working in the life-sciences domain and it is common for our result sets
>> to
>>> contain many millions of results (unfortunately). After users browse
>> their
>>> results, they then may want to download the results that they see, to do
>>> some post-processing. However, to do this, such that the results appear
>> in
>>> the order that the user originally saw them, we'd need to be able to
>> export
>>> results based on score/relevancy.
>>> 
>>> Any suggestions or advice on this would be greatly appreciated!
>>> 
>>> Many thanks!
>>> 
>>> Edd
>>> 
>>> PS. apologies for posting also on Stackoverflow (
>>> 
>> https://stackoverflow.com/questions/58167152/solrcloud-export-all-results-sorted-by-score
>> )
>>> --
>>> I only discovered the Solr mailing-list afterwards and thought it
>> probably
>>> better to reach out directly to Solr's people (I can share any answer
>> from
>>> this forum on there retrospectively).
>> 
>>

Re: Solrcloud export all results sorted by score

Reply via email to