So let me answer point by point : 1) Similarity is misleading here if you interpret it as a probabilistic measure. Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and BM25 ( that solves the problem better) you are scoring the document. Higher the score, higher the relevance of that document for the given query. BM25 does a better job in this , the relevance function will hit a saturation point so it is closer to your expectation, this blog from Doug should help[1]
2) "if document vector A is at a distance of 5 and 10 units from document vectors B and C respectively then can't we say that B is twice as relevant to A as C is to A? Or in terms of distance, C is twice as distant to A and B is to A?" Not in Lucene, at least not strictly. Current MLT uses TF-IDF as a scoring formula. When the score of B is double of the score of C, you can say that B is twice as relevant to A than C for Lucene. >From a User perspective this can be different (quoting Doug : "If an article mentions “dog” six times is it twice as relevant as an article mentioning “dog” 3 times? Most users say no") 3) MLT under the hood build a Lucene query and retrieve documents from the index. When building the MLT query, to keep it simple it extract from the seed document a subset of terms which are considered representative of the seed document ( let's call them relevant terms). This is managed through a parameter, but usually and by default you collect a limited set of relevant terms ( not all the terms). When retrieving similar documents you score them using TF-IDF ( and in the future BM25). So first of all, you can have documents with higher scores than the original ( it doesn't make sense in a probabilistic world, but this is how Lucene works). Reverting the documents, so applying the MLT to document B you could build a slightly different query. So : given seed(a) the score(b) != the score(a) given seed(b) I understand you think it doesn't make sense, but this how Lucene works. I do also understand that a lot of times users want a percentage out of a MLT query. I will work toward that direction for sure, step by step, first I need to have the MLT refactor approved and patched :) [1] https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ ----- --------------- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html