Chantal Ackermann wrote: > > your URL does not include the parameter mlt.boost. Setting that to > "true" made a noticeable difference for my queries. >
Hmm, I'm really not sure if this is doing the right thing either. When I add it I get: <float name="keywords:dehydrogenase">1.0</float> <float name="keywords:reductase">0.60737264</float> <float name="keywords:metabolism">0.27599618</float> <float name="keywords:activity">0.2476748</float> <float name="keywords:process">0.24487767</float> <float name="keywords:alcohol">0.23969446</float> <float name="keywords:and">0.1990452</float> <float name="keywords:malate">0.18447271</float> <float name="keywords:biosynthesis">0.13297324</float> <float name="keywords:biosynthetic">0.1233415</float> <float name="keywords:degradation">0.11993817</float> <float name="keywords:precursor">0.11789705</float> <float name="keywords:metabolic">0.117194556</float> <float name="keywords:protein">0.11164951</float> <float name="keywords:synthase">0.10744005</float> <float name="keywords:acid">0.09943076</float> <float name="keywords:enzyme">0.097062066</float> <float name="keywords:succinyl-coa">0.09287166</float> <float name="keywords:putative">0.0877542</float> <float name="keywords:(nadp+)">0.0864609</float> <float name="keywords:4,6-dehydratase">0.08362857</float> <float name="keywords:fatty">0.07988805</float> <float name="keywords:chloroplast">0.079598725</float> <float name="keywords:lactobacillus">0.07747293</float> <float name="keywords:glyoxylate">0.075560644</float> "and" scores far more highly than much more discriminative words like "chloroplast" and "glyoxylate", both of which have *much* higher tf.idf scores than "and" according to the TermVectorComponent: <lst name="chloroplast"> <int name="tf">8</int> <int name="df">1887</int> <double name="tf-idf">0.0042395336512983575</double> </lst> <lst name="glyoxylate"> <int name="tf">7</int> <int name="df">1111</int> <double name="tf-idf">0.0063006300630063005</double> </lst> <lst name="and"> <int name="tf">45</int> <int name="df">60316</int> <double name="tf-idf">7.460706943431262E-4</double> </lst> In fact an order of magnitude higher. Chantal Ackermann wrote: > > If not, there is also the parameter > mlt.minwl > "minimum word length below which words will be ignored." > > All your other terms seem longer than 3, so it would help in this case? > But seems a bit like work around. > Yeah, I could do that, or add a stopword list to that field. But there are some other common terms in the list like "protein" or "enzyme" that are long and not really stopwords, but have a similarly low tf.idf to "and": <lst name="protein"> <int name="tf">43</int> <int name="df">189541</int> <double name="tf-idf">2.2686384476181933E-4</double> </lst> <lst name="enzyme"> <int name="tf">15</int> <int name="df">16712</int> <double name="tf-idf">8.975586404978459E-4</double> </lst> Plus, of course, I'm curious to know exactly how MLT is identifying those terms as important, and if it's a bug or my fault... Thanks for your help though! Do any of the Solr devs have an idea of the mechanism at work here? Andrew. -- View this message in context: http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26337677.html Sent from the Solr - User mailing list archive at Nabble.com.