RE: tf and very short text fields

Markus Jelsma Fri, 04 Apr 2014 04:39:07 -0700

Hi - In this case Walter, iirc, was looking for two things: no normalization 
and no flat TF (1f for tf(float freq) > 0). We know that k1 controls TF 
saturation but in BM25Similarity you can see that k1 is multiplied by the 
encoded norm value, taking b also into account. So setting k1 to zero 
effectively disabled length normalization and results in flat or binary TF.


Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, 
term occurs three times in the field:

        28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
          6.4 = boost
          4.406719 = idf(docFreq=1, docCount=122)
          1.0 = tfNorm, computed from:
            1.5 = phraseFreq=1.5
            0.0 = parameter k1
            0.75 = parameter b
            8.721312 = avgFieldLength
            16.0 = fieldLength




        27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
          6.4 = boost
          4.406719 = idf(docFreq=1, docCount=122)
          0.98619986 = tfNorm, computed from:
            1.5 = phraseFreq=1.5
            0.2 = parameter k1
            0.75 = parameter b
            8.721312 = avgFieldLength
            16.0 = fieldLength


You can clearly see the final TF norm being 1, despite the term frequency and 
length. Please correct my wrongs :)
Markus

 
 
-----Original message-----
> From:Tom Burton-West <tburt...@umich.edu>
> Sent: Thursday 3rd April 2014 20:18
> To: solr-user@lucene.apache.org
> Subject: Re: tf and very short text fields
> 
> Hi Markus and Wunder,
> 
> I'm  missing the original context, but I don't think BM25 will solve this
> particular problem.
> 
> The k1 parameter sets how quickly the contribution of tf to the score falls
> off with increasing tf.   It would be helpful for making sure really long
> documents don't get too high a score, but I don't think it would help for
> very short documents without messing up its original design purpose.
> 
> For BM25, if you want to turn off length normalization, you set "b" to 0.
>  However, I don't think that will do what you want, since turning off
> normalization will mean that the score for "new york, new york"  will be
> twice that of the score for "new york" since without normalization the tf
> in "new york new york" is twice that of "new york".
> 
> I think the earlier suggestion to "override tfidfsimilarity and emit 1f in
> tf() is probably the best way to switch to eliminate using tf counts,
> assumming that is really what you want.
> 
> Tom
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wun...@wunderwood.org>wrote:
> 
> > Thanks! We'll try that out and report back. I keep forgetting that I want
> > to try BM25, so this is a good excuse.
> >
> > wunder
> >
> > On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jel...@openindex.io>
> > wrote:
> >
> > > Also, if i remember correctly, k1 set to zero for bm25 automatically
> > omits norms in the calculation. So thats easy to play with without
> > reindexing.
> > >
> > >
> > > Markus Jelsma <markus.jel...@openindex.io> schreef:Yes, override
> > tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
> > zero in your schema.
> > >
> > >
> > > Walter Underwood <wun...@wunderwood.org> schreef:And here is another
> > peculiarity of short text fields.
> > >
> > > The movie "New York, New York" should not be twice as relevant for the
> > query "new york". Is there a way to use a binary term frequency rather than
> > a count?
> > >
> > > wunder
> > > --
> > > Walter Underwood
> > > wun...@wunderwood.org
> > >
> > >
> > >
> >
> > --
> > Walter Underwood
> > wun...@wunderwood.org
> >
> >
> >
> >
>

RE: tf and very short text fields

Reply via email to