Solr mlt doesn't return documents with "exactly the same" contents
I have two documents with ids "aaa" and "bbb", and the titles of both documents are "a black fox jumps over a red flower". I imported both documents, along with several other testing documents, two a core "test". I want solr to return documents similar to document "aaa", so I submited the following: http://localhost:8983/solr/test/select?q=id:aaa&mlt=true&mlt.fl=title Solr returned some similar documents. However, document "bbb", which should be the most similar document of "aaa", was not in the mlt returned list. Any ideas how this could happen? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-mlt-doesn-t-return-documents-with-exactly-the-same-contents-tp4171284.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr mlt doesn't return documents with "exactly the same" contents
Hi Nishant, Thank you for the reply. I believe that solr removes the first document from the mlt list because a document is most similar to "itself" and thus should be removed. In my case, "aaa" and "bbb" are two different documents. When search for documents similar to "aaa", the document "aaa" should be removed from the list, but "bbb" should be kept. I did the experiment you suggested. Unfortunately, the document "ccc" was not in the mlt list. I modify the title of "ccc" to a somewhat different sentence "a black fox jumps over a yellow flower", but the document "ccc" was not in the list either. :-( Anyone has any clues on this? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-mlt-doesn-t-return-documents-with-exactly-the-same-contents-tp4171284p4171382.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr mlt doesn't return documents with "exactly the same" contents
After carefully reading the mlt parameters here https://wiki.apache.org/solr/MoreLikeThis I found that I can specify the following parameters to return "bbb" when search for similar documents of "aaa": mlt.mintf=1 mlt.mindf=2 Details: mlt.mintf: Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. DEFAULT_MIN_TERM_FREQ = 2 mlt.mindf: Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. DEFAULT_MIN_DOC_FREQ = 5 Hope this is helpful to those who are confused about the mlt returns. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-mlt-doesn-t-return-documents-with-exactly-the-same-contents-tp4171284p4171399.html Sent from the Solr - User mailing list archive at Nabble.com.