Hi, If I run a MoreLikeThis query like the following:
http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1 one of the hits in the results is "and" (I don't do any stopword removal on this field). However if I look inside that document with the TermVectorComponent: http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms with *much* higher tf.idf scores, e.g.: <lst name="aquaspirillum"> <int name="tf">1</int> <int name="df">10</int> <double name="tf-idf">0.1</double> </lst> that *don't* appear in the MoreLikeThis list. (I tried adding &mlt.maxwl=999 to the end of the MLT query but it makes no difference.) What's going on? Surely something with tf.idf = 0.1 is a far better candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or does MoreLikeThis do some other heuristic magic to select good candidates, and sometimes get it wrong? BTW the keywords field is indexed, stored, multi-valued and term-vectored. Thanks, Andrew. -- :: http://biotext.org.uk/ :: -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html Sent from the Solr - User mailing list archive at Nabble.com.