Hi guys, the patch seems fine to me. I didn't spend much more time on the code but I checked the tests and the pre-commit checks. It seems fine to me. Let me know ,
Cheers On 31 December 2015 at 18:40, Alessandro Benedetti <abenede...@apache.org> wrote: > https://issues.apache.org/jira/browse/LUCENE-6954 > > First draft patch available, I will check better the tests new year ! > > On 29 December 2015 at 13:43, Alessandro Benedetti <abenede...@apache.org> > wrote: > >> Sure, I will proceed tomorrow with the Jira and the simple patch + tests. >> >> In the meantime let's try to collect some additional feedback. >> >> Cheers >> >> On 29 December 2015 at 12:43, Anshum Gupta <ans...@anshumgupta.net> >> wrote: >> >>> Feel free to create a JIRA and put up a patch if you can. >>> >>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < >>> abenede...@apache.org >>> > wrote: >>> >>> > Hi guys, >>> > While I was exploring the way we build the More Like This query, I >>> > discovered a part I am not convinced of : >>> > >>> > >>> > >>> > Let's see how we build the query : >>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) >>> > >>> > 1) we extract the terms from the interesting fields, adding them to a >>> map : >>> > >>> > Map<String, Int> termFreqMap = new HashMap<>(); >>> > >>> > *( we lose the relation field-> term, we don't know anymore where the >>> term >>> > was coming ! )* >>> > >>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue >>> > >>> > 2) we build the queue that will contain the query terms, at this point >>> we >>> > connect again there terms to some field, but : >>> > >>> > ... >>> >> // go through all the fields and find the largest document frequency >>> >> String topField = fieldNames[0]; >>> >> int docFreq = 0; >>> >> for (String fieldName : fieldNames) { >>> >> int freq = ir.docFreq(new Term(fieldName, word)); >>> >> topField = (freq > docFreq) ? fieldName : topField; >>> >> docFreq = (freq > docFreq) ? freq : docFreq; >>> >> } >>> >> ... >>> > >>> > >>> > We identify the topField as the field with the highest document >>> frequency >>> > for the term t . >>> > Then we build the termQuery : >>> > >>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); >>> > >>> > In this way we lose a lot of precision. >>> > Not sure why we do that. >>> > I would prefer to keep the relation between terms and fields. >>> > The MLT query can improve a lot the quality. >>> > If i run the MLT on 2 fields : *description* and *facilities* for >>> example. >>> > It is likely I want to find documents with similar terms in the >>> > description and similar terms in the facilities, without mixing up the >>> > things and loosing the semantic of the terms. >>> > >>> > Let me know your opinion, >>> > >>> > Cheers >>> > >>> > >>> > -- >>> > -------------------------- >>> > >>> > Benedetti Alessandro >>> > Visiting card : http://about.me/alessandro_benedetti >>> > >>> > "Tyger, tyger burning bright >>> > In the forests of the night, >>> > What immortal hand or eye >>> > Could frame thy fearful symmetry?" >>> > >>> > William Blake - Songs of Experience -1794 England >>> > >>> >>> >>> >>> -- >>> Anshum Gupta >>> >> >> >> >> -- >> -------------------------- >> >> Benedetti Alessandro >> Visiting card : http://about.me/alessandro_benedetti >> >> "Tyger, tyger burning bright >> In the forests of the night, >> What immortal hand or eye >> Could frame thy fearful symmetry?" >> >> William Blake - Songs of Experience -1794 England >> > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England