Thanks for the reply, Alessandro. Can you please elaborate on a point "a document which has a score 50% of the original doc score, it doesn't mean it is 50% similar"? I did not understand this for two reasons:
1. In the end, we are calculating similarity score between documents when we are solving the Problem of Search where search query is also treated as a small document. Similarity has inherent meaning of how similar one thing is to the another. 2. If we think about the vector representations of documents in multidimensional space, we are basically calculating the "distance" between these documents. We interpret that distance as "similarity". Farther away the document vectors in that space, less similar those documents are with each other. How we calculate the distance is one thing (e.g. cosine distance, Euclidean distance,etc) but once we agree upon distance/similarity calculation method, if document vector A is at a distance of 5 and 10 units from document vectors B and C respectively then can't we say that B is twice as relevant to A as C is to A? Or in terms of distance, C is twice as distant to A and B is to A? I found this response from jlman in following thread very similar to my solution. http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=print_post&node=561671 He also warns about the scores between two documents not being bidirectional. If all else remains constant (relevancy algorithm, number of documents in index etc), why the relevancy between two documents calculated with the approach that I mentioned is not bidirectional? That is why is it possible that document A is more similar to B than B is similar to A. When I think in terms of multidimensional vector space, this does not make sense at all. Because, distance between A and B in multidimensional space is not going to change provided all else remains constant ( relevancy algorithm, number of document in index etc). If A is at a distance of 5 units from B then B is also at distance of 5 units from A. Isn't it? Thanks, Arnold On Thu, Feb 8, 2018 at 7:02 AM, Alessandro Benedetti <a.benede...@sease.io> wrote: > Hi, > I have been personally working a lot with the MoreLikeThis and I am close > to > contribute a refactor of that module ( to break up the monolithic giant > facade class mostly) . > > First of all the MoreLikeThis handler will return the original document ( > not scored) + the similar documents(scored). > The original document is not considered by the MoreLikeThis query, so it is > not returned as part of the results of the MLT lucene query, it is just > added to the response in the beginning. > > if I remember well, but I am unable to check at the moment, you should be > able to get the original document in the response set ( with max score) > using the More Like This query parser. > Please double check that > > Generally speaking at the moment TF-IDF is used under the hood, which means > that sometime the score is not probabilistic. > So a document which has a score 50% of the original doc score, it doesn't > mean it is 50% similar, but for your use case it may be a feasible > approximation. > > > > ----- > --------------- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >