> > > > > > According to different algorithms, the tf for term c would be 3 / 1 = > > 0.33 instead of 1 returned by Solr. > > I don't follow. The TF (term frequency) is the number of times the > term c occurs in that particular document, i.e. 1 time.
I see that above, and below, i made some typo's. I wrote 3 / 1 = 0.3 instead of 1 / 3 = 0.33. Term c has a #occurences of 1 which the other algorithms normalize by dividing by the number of terms. So instead of a tf = #occurences (1) other algorithms do tf = #occurences / #terms (0.33). > > > Also, the tf*idf value i get is 0.5 > > for term c and i get 0.333 for term a. It looks like tf*idf is > > quotient > > of document frequency and term frequency. > > Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF. Indeed, but most algorithms i have seen on this topic calculate idf by ln(#docs / df), this is also true for Lucene as i read http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Similarity.html idf(t) = 1 + log (numDocs / df + 1) > > > > > If i calculate tf*idf, for term c in the first document, according to > > other algorithms it would be: > > > > tf = 3 / 1 = 0.333 > > 3/1 = 3, no? I don't see where in your docs above you could even get > a 3 for the letter c. Here's the other typo, i wrote again 3 / 1 = 0.33 what should've been 1 / 3 = 0.33, of course. The differences i see are: tf (solr) = #occurences_of_term_T in document_D tf (other) = #occurences_of_term_T in document_D / #terms_document_D df (solr) = #occurences_of_term_T in all_documents df (other) = #occurences_of_term_T in all_documents idf (solr) = tf / df idf (other) = ln(#documents / df) tf*idf (solr) = tf / df tf*idf (other) = tf * idf > > > idf = ln(6 / 3) = 1.0986 > > tf*idf = 0.333 * 1.0986 = 0.3658 > > > > I think the formulas you are looking at are doing operations to > normalize the values, whereas the Solr/Lucene stuff above is telling > you their raw values. Note, Lucene/Solr does length normalization, > etc. too, it just isn't encoded into the TF or DF. For more on > Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html > I see, but why not return the true values of Lucene? I did not reconfigure Solr's scheme to use another algorithm for similarity and the above Lucene similarity docs state that they use similar calculations as i have in DefaultSimilarty. > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search >