Sure, I will proceed tomorrow with the Jira and the simple patch + tests.

In the meantime let's try to collect some additional feedback.

Cheers

On 29 December 2015 at 12:43, Anshum Gupta <ans...@anshumgupta.net> wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to a
> map :
> >
> > Map<String, Int> termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where the
> term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to