Hi.

We're LTR and after switching to multiple shards we found that rerank
happens on individual shards and during the merge phase the first pass
score isn't used. Currently our LTR model doesn't use textual match and
assumes that reranked documents are already more or less good in terms of
textual score, which is not always the case when documents are distributed
across shards.

To avoid it I've tried to use sort by function that replicates actual query
and results I get is somewhat interesting - on individual shards first pass
happens by my sorting, then documents are reranked and during the merge
documents from the same shard are compared by "orderInShard" and from
different shards by value from sort, so that final order is neither sort
value nor score.
For example let's assume that documents coming from shard 1 are:
    doc1(first_pass_score = 1, second_pass_score = 2)
    doc2(first_pass_score = 4, second_pass_score = 1)
and documents coming from shard 2 are:
    doc4(first_pass_score = 3, second_pass_score = 4)
    doc3(first_pass_score = 2, second_pass_score = 3)
where first_pass_score is doc.sort_values[0] and second_pass_score is
doc.score

when we try to merge all documents this will happen
    queue.insertWithOverflow(doc1)
    queue.insertWithOverflow(doc2)
        queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 <
doc2.orderInShard = 2)
    queue.insertWithOverflow(doc4)
        queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 >
doc2.first_pass_score = 3)
    queue.insertWithOverflow(doc3)
        queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 <
doc3.orderInShard = 2)

and final documents result will be:
    doc1(first_pass_score = 1, second_pass_score = 2)
    doc2(first_pass_score = 4, second_pass_score = 1)
    doc4(first_pass_score = 3, second_pass_score = 4)
    doc3(first_pass_score = 2, second_pass_score = 3)

Ideally I would want to see rerank happening based on global order across
all shards, I've implemented custom component that asks shards to
return *Math.max(reRankDocs,
offset + rows)* documents, which are first sorted by first pass score and
then only top *reRankDocs *are sorted by second pass score. I understand
that it might not be the best way in terms of performance (we rerank only
top 60 documents so it's not that big of a deal), but it's functionally
equivalent to the single shard behavior.

I'm curious if current behavior is intended or not, typically I would
expect either something I described above or at least ignoring sort during
the merge and using only doc.score that was generated by LTR rescorer.
Maybe the community would be interested in the approach I've implemented?
Or is it considered bad design to rely on first pass score and our LTR
model should use fields from first pass / use OriginalScoreFeature?

Reply via email to