Hello list,

I have a question about Lucene's calculation of tf*idf value. I first
noticed that Solr's tf does not compare to tf values based on
calculation elsewhere such as
http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item%
3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf 

The tf values returned by Solr are always integers and not normalized
against the length of the corpus whilst the field in which it resides
does not have omitNorms="true". 

Consider the following documents where the field subject is of the
standard text_ws type:

<result name="response" numFound="6" start="0">
    <doc>
        <str name="subject">a b c</str>
    </doc>
    <doc>
        <str name="subject">d e f</str>
    </doc>
    <doc>
        <str name="subject">x y z</str>
    </doc>
    <doc>
        <str name="subject">a d x</str>
    </doc>
    <doc>
        <str name="subject">a e z</str>
    </doc>
    <doc>
        <str name="subject">c f z</str>
    </doc>
</result>

Now, Solr's TermVector results for the first document:

<lst name="doc-0">
    <str name="uniqueKey">0</str>
        <lst name="subject">
        <lst name="a">
            <int name="tf">1</int>
            <lst name="positions">
                <int name="position">0</int>
             </lst>
            <int name="df">3</int>
            <double name="tf-idf">0.3333333333333333</double>
        </lst>
        <lst name="b">
            <int name="tf">1</int>
            <lst name="positions">
                <int name="position">1</int>
            </lst>
            <int name="df">1</int>
            <double name="tf-idf">1.0</double>
        </lst>
        <lst name="c">
            <int name="tf">1</int>
            <lst name="positions">
                <int name="position">2</int>
            </lst>
            <int name="df">2</int>
            <double name="tf-idf">0.5</double>
        </lst>
    </lst>
</lst>


According to different algorithms, the tf for term c would be 3 / 1 =
0.33 instead of 1 returned by Solr. Also, the tf*idf value i get is 0.5
for term c and i get 0.333 for term a. It looks like tf*idf is quotient
of document frequency and term frequency.

If i calculate tf*idf, for term c in the first document, according to
other algorithms it would be:

tf = 3 / 1 = 0.333
idf = ln(6 / 3) = 1.0986
tf*idf = 0.333 * 1.0986 = 0.3658

Can someone explain either the difference demonstrated or tell me what i
am possibly doing wrong?



Cheers,

-  
Markus Jelsma          Buyways B.V.            
Technisch Architect    Friesestraatweg 215c    
http://www.buyways.nl  9743 AD Groningen       


Alg. 050-853 6600      KvK  01074105
Tel. 050-853 6620      Fax. 050-3118124
Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17

Reply via email to