benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1589218572

   > To me the main concern is more that unbounded scores make it hard to 
combine scores with another query via a disjunction as it's hard to know ahead 
of time whether the vector may completely dominate scores.
   
   I say we have that problem now. Vector scores and BM25 are nowhere near on 
the same scale. Folks need to adjust their boosts accordingly, regardless.
   
   > And also bw compat as MikeS raised.
   
   I agree, BWC is a big deal here. And I suggest we create a new similarity 
that just uses dot product under the hood. Call it `maximum-inner-product`.
   
   >  As for producing a D.P. that is scaled for use with arbitrary vectors I 
don't see the point really. If what you want is to handle arbitrary scaled 
vectors, EUCLIDEAN is a better choice. 
   
   Quoting a SLACK conversation with @nreimers:
   >Wow, what a bad implementation by Elastic.
   Models with unnormalized vectors and dot product work better for search than 
models with normalized vectors / cosine similarity.
   Models with cosine similarity have the issue that they often retrieve noise 
when your dataset gets noisier
   >...The best match for a query (e.g. What is the capital of the US )  with 
cosine similarity is the query itself, as cossim(query, query)=1.
   So when your corpus gets bigger and is not carefully cleaned, it contains 
many short documents that look like queries. These are preferably retrieved by 
the model, so the user asks a questions and gets as a response a doc that is 
paraphrase of the query (e.g. query="What is the capital of the US"  top-1 hit: 
Capital of the US).
   Dot product has the tendency to work better when your corpus gets larger / 
noisy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to