Any ideas on this? Is it worth sending a bug report? Those links are live, by the way, in case anyone wants to verify that MLT is returning suggestions with very low tf.idf.
Cheers, Andrew. Andrew Clegg wrote: > > Hi, > > If I run a MoreLikeThis query like the following: > > http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1 > > one of the hits in the results is "and" (I don't do any stopword removal > on this field). > > However if I look inside that document with the TermVectorComponent: > > http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords > > I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms > with *much* higher tf.idf scores, e.g.: > > <lst name="aquaspirillum"> > <int name="tf">1</int> > <int name="df">10</int> > <double name="tf-idf">0.1</double> > </lst> > > that *don't* appear in the MoreLikeThis list. (I tried adding > &mlt.maxwl=999 to the end of the MLT query but it makes no difference.) > > What's going on? Surely something with tf.idf = 0.1 is a far better > candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? > Or does MoreLikeThis do some other heuristic magic to select good > candidates, and sometimes get it wrong? > > BTW the keywords field is indexed, stored, multi-valued and term-vectored. > > Thanks, > > Andrew. > > -- > :: http://biotext.org.uk/ :: > > -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html Sent from the Solr - User mailing list archive at Nabble.com.