[R] Retaining the original document id in #topicmodels in R

张伦 Tue, 24 Jun 2014 06:54:38 -0700

Hi all,
I am currently using package "topicmodels" to find the topics of a given
text.
The dataset contains 8523 documents. I would like to see which documents
belong to which topic.


Here is my code:
########################get the documentTermMatrix#########
tdm=DocumentTermMatrix(corpus,control)
length(tdm$dimnames$Terms)
dim (tdm)   ##################the dimension of tdm is "[1]  8513 21135"
library ("slam")
library ("topicmodels")

term_tfidf <-tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) *
log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
summary(col_sums(tdm))

tdm <- tdm[,term_tfidf >= 0.15]
tdm2 <- tdm[row_sums(tdm) > 0,]
dim(tdm2) ######################now the dim of tdm2 is *8513 10091##*

###################topic modeling analysis######################

k <- 30
lda <-LDA (tdm2, control=list(alpha=0.1),k)

###### cell values as posterior topic distribution for each document#####
gammaDF <- as.data.frame(lda@gamma)
names(gammaDF) <- c(1:k)
# inspect...
gammaDF
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
  topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
sapply(toptopics, class)
toptopics<-unlist(toptopics)
write.csv (toptopics, "topicdistribution.csv")


Some of the documents (in this case, 10 documents) were excluded since some
of them contain zero entry . Therefore, I cannot match the original
document ID with the result of the topics.

My question is how can I include the original document id and match these
id numbers with the topics?

ZHANG Lun

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Retaining the original document id in #topicmodels in R

Reply via email to