hellosunil commented on issue #14769: URL: https://github.com/apache/lucene/issues/14769#issuecomment-2963050158
I agree that using internal doc ID as a tie-breaker for sorting documents with identical scores within a single query result is reasonable. However, I'm concerned about a specific scenario in multi-query RRF fusion. Consider a case with **keyword search + vector search**. If all documents from the keyword search have identical scores, but the vector search provides different rankings, shouldn't the RRF scores reflect the vector search rankings rather than arbitrary positional rankings from the keyword search? **Example:** Let's say we have 4 documents (A, B, C, D) and two queries: **Query 1 (Keyword Search) - All tied but sorted by doc ID:** - Doc A: score = 0.16, position = 0 → rank = 1 - Doc B: score = 0.16, position = 1 → rank = 2 - Doc C: score = 0.16, position = 2 → rank = 3 - Doc D: score = 0.16, position = 3 → rank = 4 *(pushed down due to doc ID)* **Query 2 (Vector Search):** - Doc D: score = 0.9, position = 0 → rank = 1 *(best match!)* - Doc A: score = 0.8, position = 1 → rank = 2 - Doc B: score = 0.7, position = 2 → rank = 3 - Doc C: score = 0.6, position = 3 → rank = 4 **Current RRF Calculation (k=60):** ```other Document | Query1 Rank | Query2 Rank | RRF Score Calculation | Final Score ---------|-------------|-------------|------------------------------------------|------------ Doc A | 1 | 2 | 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 Doc B | 2 | 3 | 1/(60+2) + 1/(60+3) = 0.0161 + 0.0159 = 0.0320 Doc C | 3 | 4 | 1/(60+3) + 1/(60+4) = 0.0159 + 0.0156 = 0.0315 Doc D | 4 | 1 | 1/(60+4) + 1/(60+1) = 0.0156 + 0.0164 = 0.0320 ``` **Expected RRF Calculation (if tied scores got equal rank=1):** ```other Document | Query1 Rank | Query2 Rank | RRF Score Calculation | Final Score ---------|-------------|-------------|------------------------------------------|------------ Doc D | 1 | 1 | 1/(60+1) + 1/(60+1) = 0.0164 + 0.0164 = 0.0328 ← Should be #1! Doc A | 1 | 2 | 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 Doc B | 1 | 3 | 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323 Doc C | 1 | 4 | 1/(60+1) + 1/(60+4) = 0.0164 + 0.0156 = 0.0320 ``` In this example, **Doc D should be ranked #1** since it's the best vector match while being tied in keyword search. However, the current implementation ranks Doc A highest due to arbitrary positional ranking, even though Doc D is clearly the better overall match. This demonstrates how positional ranking for tied scores can lead to suboptimal RRF results that don't properly reflect the true relevance across multiple query types. Am I missing something or misunderstanding anything here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org