nreimers commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222
@msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663 The issue with cosine similarity is that it just encodes the topic. For the query `What is Pytorch`, and you have two docs: A: ``Pytorch is a framework`` B: ``PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface`` A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector). So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn't perfect for retrieval). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org