I think there is a problem with R's tm package's weightTfIdf function.

The manual says that the idf is calculated as

idf(term ) = log (|D|/number of documents that contain the term)

In cases where the dictionary is passed in the control list as given below:

dtm =
DocumentTermMatrix(myCorpus,control=list(dictionary=myDict,weighting=function(x)weightTfIdf(x,normalize=FALSE)))

There are chances that there is no document that contains a term.In that
case the denominator in the idf becomes 0 leading to a NAN

How can I raise this issue and actually fix the code so that  we use

idf(term) = log (|D|/number of documents that contain the term+1)

Any help on this would be appreciated!

Shivani

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to