Thanks for the reply,  Alessandro.

Can you please elaborate on a point  "a document which has a score 50% of
the original doc score, it doesn't
mean it is 50% similar"? I did not understand this for two reasons:

1. In the end, we are calculating similarity score between documents when
we are solving the Problem of Search where search query is also treated as
a small document. Similarity has inherent meaning of how similar one thing
is to the another.

2. If we think about the vector representations of documents in
multidimensional space, we are basically calculating the "distance" between
these documents. We interpret that distance as "similarity". Farther away
the document vectors in that space, less similar those documents are with
each other. How we calculate the distance is one thing (e.g. cosine
distance, Euclidean distance,etc) but once we agree upon
distance/similarity calculation method, if document vector A is at a
distance of 5 and 10 units from document vectors B and C respectively then
can't we say that B is twice as relevant to A as C is to A? Or in terms of
distance, C is twice as distant to  A and B is to A?


I found this response from jlman in following thread very similar to my
solution.

http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=print_post&node=561671


He also warns about the scores between two documents not being
bidirectional.

If all else remains constant (relevancy algorithm, number of documents in
index etc), why the relevancy between two documents calculated with the
approach that I mentioned is not bidirectional? That is why is it possible
that document A is more similar to B than B is similar to A.
When I think in terms of multidimensional vector space, this does not make
sense at all. Because, distance between A and B in multidimensional space
is not going to change provided all else remains constant ( relevancy
algorithm, number of document in index etc). If A is at a distance of 5
units from B then B is also at distance of 5 units from A. Isn't it?

Thanks,
Arnold

On Thu, Feb 8, 2018 at 7:02 AM, Alessandro Benedetti <a.benede...@sease.io>
wrote:

> Hi,
> I have been personally working a lot with the MoreLikeThis and I am close
> to
> contribute a refactor of that module ( to break up the monolithic giant
> facade class mostly) .
>
> First of all the MoreLikeThis handler will return the original document (
> not scored) + the similar documents(scored).
> The original document is not considered by the MoreLikeThis query, so it is
> not returned as part of the results of the MLT lucene query, it is just
> added to the response in the beginning.
>
> if I remember well, but I am unable to check at the moment, you should be
> able to get the original document in the response set ( with max score)
> using the More Like This query parser.
> Please double check that
>
> Generally speaking at the moment TF-IDF is used under the hood, which means
> that sometime the score is not probabilistic.
> So a document which has a score 50% of the original doc score, it doesn't
> mean it is 50% similar, but for your use case it may be a feasible
> approximation.
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Reply via email to