Hi Arthur, I'm facing a similar issue with an LTR query over multiple collections in SolrCloud. The issue is that the documents returned and merged into a single page will have scores that don't look like sorted at all.
For example (this is a single page of results): // collection1 -2.1818457 -2.1818457 ... 4.2359614 // collection2 -2.224318 // collection1 2.7780528 // collection3 2.807676// collection1 -1.3967791 The expectation I had while testing against a single collection: the reranked N documents are placed at the top of the page and the tail of documents will be sorted by the original non-LTR scoring model (like TF-IDF or BM25). And this is how a single shard returned the results. The expectation for multiple queried collections: all reranked documents form a top of the page (and the question here is: should this top be of size N or N*number of collections), having the tail of the documents interleaved and sorted by the non-LTR scoring model. Would you mind sharing the details of your component, provided that you would still be interested in sharing your implementation with the community? Thanks! On Tue, Jul 21, 2020 at 11:33 PM Arthur Gavlyukovskiy < agavlyukovs...@gmail.com> wrote: > Hi. > > We're LTR and after switching to multiple shards we found that rerank > happens on individual shards and during the merge phase the first pass > score isn't used. Currently our LTR model doesn't use textual match and > assumes that reranked documents are already more or less good in terms of > textual score, which is not always the case when documents are distributed > across shards. > > To avoid it I've tried to use sort by function that replicates actual query > and results I get is somewhat interesting - on individual shards first pass > happens by my sorting, then documents are reranked and during the merge > documents from the same shard are compared by "orderInShard" and from > different shards by value from sort, so that final order is neither sort > value nor score. > For example let's assume that documents coming from shard 1 are: > doc1(first_pass_score = 1, second_pass_score = 2) > doc2(first_pass_score = 4, second_pass_score = 1) > and documents coming from shard 2 are: > doc4(first_pass_score = 3, second_pass_score = 4) > doc3(first_pass_score = 2, second_pass_score = 3) > where first_pass_score is doc.sort_values[0] and second_pass_score is > doc.score > > when we try to merge all documents this will happen > queue.insertWithOverflow(doc1) > queue.insertWithOverflow(doc2) > queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 < > doc2.orderInShard = 2) > queue.insertWithOverflow(doc4) > queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 > > doc2.first_pass_score = 3) > queue.insertWithOverflow(doc3) > queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 < > doc3.orderInShard = 2) > > and final documents result will be: > doc1(first_pass_score = 1, second_pass_score = 2) > doc2(first_pass_score = 4, second_pass_score = 1) > doc4(first_pass_score = 3, second_pass_score = 4) > doc3(first_pass_score = 2, second_pass_score = 3) > > Ideally I would want to see rerank happening based on global order across > all shards, I've implemented custom component that asks shards to > return *Math.max(reRankDocs, > offset + rows)* documents, which are first sorted by first pass score and > then only top *reRankDocs *are sorted by second pass score. I understand > that it might not be the best way in terms of performance (we rerank only > top 60 documents so it's not that big of a deal), but it's functionally > equivalent to the single shard behavior. > > I'm curious if current behavior is intended or not, typically I would > expect either something I described above or at least ignoring sort during > the merge and using only doc.score that was generated by LTR rescorer. > Maybe the community would be interested in the approach I've implemented? > Or is it considered bad design to rely on first pass score and our LTR > model should use fields from first pass / use OriginalScoreFeature? > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: https://semanticanalyzer.info