benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1589218572
> To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores. I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless. > And also bw compat as MikeS raised. I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it `maximum-inner-product`. > As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice. Quoting a SLACK conversation with @nreimers: >Wow, what a bad implementation by Elastic. Models with unnormalized vectors and dot product work better for search than models with normalized vectors / cosine similarity. Models with cosine similarity have the issue that they often retrieve noise when your dataset gets noisier >...The best match for a query (e.g. What is the capital of the US ) with cosine similarity is the query itself, as cossim(query, query)=1. So when your corpus gets bigger and is not carefully cleaned, it contains many short documents that look like queries. These are preferably retrieved by the model, so the user asks a questions and gets as a response a doc that is paraphrase of the query (e.g. query="What is the capital of the US" top-1 hit: Capital of the US). Dot product has the tendency to work better when your corpus gets larger / noisy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org