nreimers commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222

   @msokolov In our BEIR paper we talked about this:
   https://arxiv.org/abs/2104.08663
   
   The issue with cosine similarity is that it just encodes the topic. For the 
query `What is Pytorch`,  and you have two docs:
   A:
   ``Pytorch is a framework``
   
   B:
   ``PyTorch is a machine learning framework based on the Torch library, used 
for applications such as computer vision and natural language processing, 
originally developed by Meta AI and now part of the Linux Foundation umbrella. 
It is free and open-source software released under the modified BSD license. 
Although the Python interface is more polished and the primary focus of 
development, PyTorch also has a C++ interface``
   
   A cosine similarity model would retrieve A (it is more on-topic). A 
dotproduct model would retrieve B, as it is on-topic as well and contains more 
information on the topic (represented as a longer vector).
   
   So dotproduct models can encode topic + quantity/quality of information, 
while cosine similarity models just encode topic match (which isn't perfect for 
retrieval). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to