So let me answer point by point :

1) Similarity is misleading here if you interpret it as a probabilistic
measure. 
Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and
BM25 ( that solves the problem better) you are scoring the document. Higher
the score, higher the relevance of that document for the given query. BM25
does a better job in this , the relevance function will hit a saturation
point so it is closer to your expectation, this blog from Doug should
help[1]

2) "if document vector A is at a 
distance of 5 and 10 units from document vectors B and C respectively then 
can't we say that B is twice as relevant to A as C is to A? Or in terms of 
distance, C is twice as distant to  A and B is to A?"

Not in Lucene, at least not strictly.
Current MLT uses TF-IDF as a scoring formula.
When the score of B is double of the score of C, you can say that B is twice
as relevant to A than C for Lucene.
>From a User perspective this can be different (quoting Doug  : "If an
article mentions “dog” six times is it twice as relevant as an article
mentioning “dog” 3 times? Most users say no")

3) MLT under the hood build a Lucene query and retrieve documents from the
index.
When building the MLT query, to keep it simple it extract from the seed
document a subset of terms which are considered representative of the seed
document ( let's call them relevant terms).
This is managed through a parameter, but usually and by default you collect
a limited set of relevant terms ( not all the terms).
When retrieving similar documents you score them using TF-IDF ( and in the
future BM25).
So first of all, you can have documents with higher scores than the original
( it doesn't make sense in a probabilistic world, but this is how Lucene
works).
Reverting the documents, so applying the MLT to document B you could build a
slightly different query.
So :
given seed(a) the score(b) != the score(a) given seed(b)

I understand you think it doesn't make sense, but this how Lucene works.

I do also understand that a lot of times users want a percentage out of a
MLT query.
I will work toward that direction for sure, step by step, first I need to
have the MLT refactor approved and patched :)




[1]
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to