Hello list,
I have a question about Lucene's calculation of tf*idf value. I first noticed that Solr's tf does not compare to tf values based on calculation elsewhere such as http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item% 3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf The tf values returned by Solr are always integers and not normalized against the length of the corpus whilst the field in which it resides does not have omitNorms="true". Consider the following documents where the field subject is of the standard text_ws type: <result name="response" numFound="6" start="0"> <doc> <str name="subject">a b c</str> </doc> <doc> <str name="subject">d e f</str> </doc> <doc> <str name="subject">x y z</str> </doc> <doc> <str name="subject">a d x</str> </doc> <doc> <str name="subject">a e z</str> </doc> <doc> <str name="subject">c f z</str> </doc> </result> Now, Solr's TermVector results for the first document: <lst name="doc-0"> <str name="uniqueKey">0</str> <lst name="subject"> <lst name="a"> <int name="tf">1</int> <lst name="positions"> <int name="position">0</int> </lst> <int name="df">3</int> <double name="tf-idf">0.3333333333333333</double> </lst> <lst name="b"> <int name="tf">1</int> <lst name="positions"> <int name="position">1</int> </lst> <int name="df">1</int> <double name="tf-idf">1.0</double> </lst> <lst name="c"> <int name="tf">1</int> <lst name="positions"> <int name="position">2</int> </lst> <int name="df">2</int> <double name="tf-idf">0.5</double> </lst> </lst> </lst> According to different algorithms, the tf for term c would be 3 / 1 = 0.33 instead of 1 returned by Solr. Also, the tf*idf value i get is 0.5 for term c and i get 0.333 for term a. It looks like tf*idf is quotient of document frequency and term frequency. If i calculate tf*idf, for term c in the first document, according to other algorithms it would be: tf = 3 / 1 = 0.333 idf = ln(6 / 3) = 1.0986 tf*idf = 0.333 * 1.0986 = 0.3658 Can someone explain either the difference demonstrated or tell me what i am possibly doing wrong? Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17