Inline below
On Nov 3, 2009, at 2:30 AM, Markus Jelsma - Buyways B.V. wrote:
Hello list,
I have a question about Lucene's calculation of tf*idf value. I first
noticed that Solr's tf does not compare to tf values based on
calculation elsewhere such as
http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item%
3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf
The tf values returned by Solr are always integers and not normalized
against the length of the corpus whilst the field in which it resides
does not have omitNorms="true".
Consider the following documents where the field subject is of the
standard text_ws type:
<result name="response" numFound="6" start="0">
<doc>
<str name="subject">a b c</str>
</doc>
<doc>
<str name="subject">d e f</str>
</doc>
<doc>
<str name="subject">x y z</str>
</doc>
<doc>
<str name="subject">a d x</str>
</doc>
<doc>
<str name="subject">a e z</str>
</doc>
<doc>
<str name="subject">c f z</str>
</doc>
</result>
Now, Solr's TermVector results for the first document:
<lst name="doc-0">
<str name="uniqueKey">0</str>
<lst name="subject">
<lst name="a">
<int name="tf">1</int>
<lst name="positions">
<int name="position">0</int>
</lst>
<int name="df">3</int>
<double name="tf-idf">0.3333333333333333</double>
</lst>
<lst name="b">
<int name="tf">1</int>
<lst name="positions">
<int name="position">1</int>
</lst>
<int name="df">1</int>
<double name="tf-idf">1.0</double>
</lst>
<lst name="c">
<int name="tf">1</int>
<lst name="positions">
<int name="position">2</int>
</lst>
<int name="df">2</int>
<double name="tf-idf">0.5</double>
</lst>
</lst>
</lst>
According to different algorithms, the tf for term c would be 3 / 1 =
0.33 instead of 1 returned by Solr.
I don't follow. The TF (term frequency) is the number of times the
term c occurs in that particular document, i.e. 1 time.
Also, the tf*idf value i get is 0.5
for term c and i get 0.333 for term a. It looks like tf*idf is
quotient
of document frequency and term frequency.
Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF.
If i calculate tf*idf, for term c in the first document, according to
other algorithms it would be:
tf = 3 / 1 = 0.333
3/1 = 3, no? I don't see where in your docs above you could even get
a 3 for the letter c.
idf = ln(6 / 3) = 1.0986
tf*idf = 0.333 * 1.0986 = 0.3658
I think the formulas you are looking at are doing operations to
normalize the values, whereas the Solr/Lucene stuff above is telling
you their raw values. Note, Lucene/Solr does length normalization,
etc. too, it just isn't encoded into the TF or DF. For more on
Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search