Re: Selection of terms for MoreLikeThis

Andrew Clegg Fri, 13 Nov 2009 06:52:30 -0800


Chantal Ackermann wrote:
> 
> your URL does not include the parameter mlt.boost. Setting that to 
> "true" made a noticeable difference for my queries.
>


Hmm, I'm really not sure if this is doing the right thing either. When I add
it I get:

 <float name="keywords:dehydrogenase">1.0</float>
 <float name="keywords:reductase">0.60737264</float>
 <float name="keywords:metabolism">0.27599618</float>
 <float name="keywords:activity">0.2476748</float>
 <float name="keywords:process">0.24487767</float>
 <float name="keywords:alcohol">0.23969446</float>
 <float name="keywords:and">0.1990452</float>
 <float name="keywords:malate">0.18447271</float>
 <float name="keywords:biosynthesis">0.13297324</float>
 <float name="keywords:biosynthetic">0.1233415</float>
 <float name="keywords:degradation">0.11993817</float>
 <float name="keywords:precursor">0.11789705</float>
 <float name="keywords:metabolic">0.117194556</float>
 <float name="keywords:protein">0.11164951</float>
 <float name="keywords:synthase">0.10744005</float>
 <float name="keywords:acid">0.09943076</float>
 <float name="keywords:enzyme">0.097062066</float>
 <float name="keywords:succinyl-coa">0.09287166</float>
 <float name="keywords:putative">0.0877542</float>
 <float name="keywords:(nadp+)">0.0864609</float>
 <float name="keywords:4,6-dehydratase">0.08362857</float>
 <float name="keywords:fatty">0.07988805</float>
 <float name="keywords:chloroplast">0.079598725</float>
 <float name="keywords:lactobacillus">0.07747293</float>
 <float name="keywords:glyoxylate">0.075560644</float>

"and" scores far more highly than much more discriminative words like
"chloroplast" and "glyoxylate", both of which have *much* higher tf.idf
scores than "and" according to the TermVectorComponent:

<lst name="chloroplast">
<int name="tf">8</int>
<int name="df">1887</int>
<double name="tf-idf">0.0042395336512983575</double>
</lst>

<lst name="glyoxylate">
<int name="tf">7</int>
<int name="df">1111</int>
<double name="tf-idf">0.0063006300630063005</double>
</lst>

<lst name="and">
<int name="tf">45</int>
<int name="df">60316</int>
<double name="tf-idf">7.460706943431262E-4</double>
</lst>

In fact an order of magnitude higher.


Chantal Ackermann wrote:
> 
> If not, there is also the parameter
>   mlt.minwl
> "minimum word length below which words will be ignored."
> 
> All your other terms seem longer than 3, so it would help in this case? 
> But seems a bit like work around.
> 

Yeah, I could do that, or add a stopword list to that field. But there are
some other common terms in the list like "protein" or "enzyme" that are long
and not really stopwords, but have a similarly low tf.idf to "and":

<lst name="protein">
<int name="tf">43</int>
<int name="df">189541</int>
<double name="tf-idf">2.2686384476181933E-4</double>
</lst>

<lst name="enzyme">
<int name="tf">15</int>
<int name="df">16712</int>
<double name="tf-idf">8.975586404978459E-4</double>
</lst>

Plus, of course, I'm curious to know exactly how MLT is identifying those
terms as important, and if it's a bug or my fault...

Thanks for your help though! Do any of the Solr devs have an idea of the
mechanism at work here?

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26337677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

Reply via email to