Hi - In this case Walter, iirc, was looking for two things: no normalization and no flat TF (1f for tf(float freq) > 0). We know that k1 controls TF saturation but in BM25Similarity you can see that k1 is multiplied by the encoded norm value, taking b also into account. So setting k1 to zero effectively disabled length normalization and results in flat or binary TF.
Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, term occurs three times in the field: 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 1.0 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.0 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 0.98619986 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.2 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength You can clearly see the final TF norm being 1, despite the term frequency and length. Please correct my wrongs :) Markus -----Original message----- > From:Tom Burton-West <tburt...@umich.edu> > Sent: Thursday 3rd April 2014 20:18 > To: solr-user@lucene.apache.org > Subject: Re: tf and very short text fields > > Hi Markus and Wunder, > > I'm missing the original context, but I don't think BM25 will solve this > particular problem. > > The k1 parameter sets how quickly the contribution of tf to the score falls > off with increasing tf. It would be helpful for making sure really long > documents don't get too high a score, but I don't think it would help for > very short documents without messing up its original design purpose. > > For BM25, if you want to turn off length normalization, you set "b" to 0. > However, I don't think that will do what you want, since turning off > normalization will mean that the score for "new york, new york" will be > twice that of the score for "new york" since without normalization the tf > in "new york new york" is twice that of "new york". > > I think the earlier suggestion to "override tfidfsimilarity and emit 1f in > tf() is probably the best way to switch to eliminate using tf counts, > assumming that is really what you want. > > Tom > > > > > > > > > On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wun...@wunderwood.org>wrote: > > > Thanks! We'll try that out and report back. I keep forgetting that I want > > to try BM25, so this is a good excuse. > > > > wunder > > > > On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jel...@openindex.io> > > wrote: > > > > > Also, if i remember correctly, k1 set to zero for bm25 automatically > > omits norms in the calculation. So thats easy to play with without > > reindexing. > > > > > > > > > Markus Jelsma <markus.jel...@openindex.io> schreef:Yes, override > > tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to > > zero in your schema. > > > > > > > > > Walter Underwood <wun...@wunderwood.org> schreef:And here is another > > peculiarity of short text fields. > > > > > > The movie "New York, New York" should not be twice as relevant for the > > query "new york". Is there a way to use a binary term frequency rather than > > a count? > > > > > > wunder > > > -- > > > Walter Underwood > > > wun...@wunderwood.org > > > > > > > > > > > > > -- > > Walter Underwood > > wun...@wunderwood.org > > > > > > > > >