Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1

one of the hits in the results is "and" (I don't do any stopword removal on
this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords

I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:

<lst name="aquaspirillum">
<int name="tf">1</int>
<int name="df">10</int>
<double name="tf-idf">0.1</double>
</lst>

that *don't* appear in the MoreLikeThis list. (I tried adding &mlt.maxwl=999
to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or
does MoreLikeThis do some other heuristic magic to select good candidates,
and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

-- 
:: http://biotext.org.uk/ ::

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to