Hi guys,
While I was exploring the way we build the More Like This query, I
discovered a part I am not convinced of :



Let's see how we build the query :
org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)

1) we extract the terms from the interesting fields, adding them to a map :

Map<String, Int> termFreqMap = new HashMap<>();

*( we lose the relation field-> term, we don't know anymore where the term
was coming ! )*

org.apache.lucene.queries.mlt.MoreLikeThis#createQueue

2) we build the queue that will contain the query terms, at this point we
connect again there terms to some field, but :

...
> // go through all the fields and find the largest document frequency
> String topField = fieldNames[0];
> int docFreq = 0;
> for (String fieldName : fieldNames) {
>   int freq = ir.docFreq(new Term(fieldName, word));
>   topField = (freq > docFreq) ? fieldName : topField;
>   docFreq = (freq > docFreq) ? freq : docFreq;
> }
> ...


We identify the topField as the field with the highest document frequency
for the term t .
Then we build the termQuery :

queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));

In this way we lose a lot of precision.
Not sure why we do that.
I would prefer to keep the relation between terms and fields.
The MLT query can improve a lot the quality.
If i run the MLT on 2 fields : *description* and *facilities* for example.
It is likely I want to find documents with similar terms in the description
and similar terms in the facilities, without mixing up the things and
loosing the semantic of the terms.

Let me know your opinion,

Cheers


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to